proceedingsofimpact2013 - cs.colostate.edu

72
Proceedings of IMPACT 2013

Upload: others

Post on 06-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ProceedingsofIMPACT2013 - cs.colostate.edu

Proceedings of IMPACT 2013

Page 2: ProceedingsofIMPACT2013 - cs.colostate.edu
Page 3: ProceedingsofIMPACT2013 - cs.colostate.edu

IMPACT 2013

Proceedings of the

3rd International Workshop onPolyhedral Compilation Techniques

Berlin, GermanyJanuary 21, 2013

in conjunction with HiPEAC 2013

Workshop organizers and proceedings editors:

Armin GrößlingerLouis-Noël Pouchet

Page 4: ProceedingsofIMPACT2013 - cs.colostate.edu

Each article in these proceedings is copyrighted by its respective authors.

Page 5: ProceedingsofIMPACT2013 - cs.colostate.edu

Acknowledgements

The program chairs are grateful to the members of the program committee for their incrediblework, time commitment and dedication during the review process for IMPACT 2013. We arealso grateful to the authors who submitted to IMPACT, Albert Cohen for his keynote speech,and all participants and attendees of IMPACT 2013 in Berlin.

IMPACT 2013 in Berlin, Germany (in conjuction with HiPEAC 2013) is the third workshop in aseries of international workshops on polyhedral compilation techniques. The previous workshopswere held in Chamonix, France (2011) in conjuction with CGO 2011 and Paris, France (2012) inconjuction with HiPEAC 2012.

iii

Page 6: ProceedingsofIMPACT2013 - cs.colostate.edu

iv

Page 7: ProceedingsofIMPACT2013 - cs.colostate.edu

Contents

Committees 1

Tiling & Dependence Analysis

David G. Wonnacott, Michelle Mills StroutOn the Scalability of Loop Tiling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 3

Tomofumi Yuki, Sanjay RajopadhyeMemory Allocations for Tiled Uniform Dependence Programs . . . . . . . . . . . . . . . . 13

Sven Verdoolaege, Hristo Nikolov, Todor StefanovOn Demand Parametric Array Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . . 23

Parallelism Constructs & Speculation

Imèn Fassi, Philippe Clauss, Matthieu Kuhn, Yosr SlamaMultifor for Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Dustin Feld, Thomas Soddemann, Michael Jünger, Sven MallachFacilitate SIMD-Code-Generation in the Polyhedral Model byHardware-aware Automatic Code-Transformation . . . . . . . . . . . . . . . . . . . . . . . 45

Johannes Doerfert, Clemens Hammacher, Kevin Streit, Sebastian HackSPolly: Speculative Optimizations in the Polyhedral Model . . . . . . . . . . . . . . . . . 55

v

Page 8: ProceedingsofIMPACT2013 - cs.colostate.edu

vi

Page 9: ProceedingsofIMPACT2013 - cs.colostate.edu

Organizers and Program ChairsArmin Größlinger (University of Passau, Germany)Louis-Noël Pouchet (University of California Los Angeles, USA)

Program CommitteeChristophe Alias (ENS Lyon, France)Cédric Bastoul (INRIA, France)Uday Bondhugula (IISc, India)Philippe Clauss (University of Strasbourg, France)Albert Cohen (INRIA, France)Alain Darte (ENS Lyon, France)Paul Feautrier (ENS Lyon, France)Martin Griebl (University of Passau, Germany)Sebastian Hack (Saarland University, Germany)François Irigoin (MINES ParisTech, France)Paul Kelly (Imperial College London, UK)Ronan Keryell (Wild Systems/Silkan, USA)Vincent Loechner (University of Strasbourg, France)Benoît Meister (Reservoir Labs, Inc., USA)Sanjay Rajopadhye (Colorado State University, USA)P. Sadayappan (Ohio State University, USA)Michelle Mills Strout (Colorado State University, USA)Nicolas Vasilache (Reservoir Labs, Inc., USA)Sven Verdoolaege (KU Leuven/ENS, France)

Additional ReviewersSomashekar BhaskaracharyaRoshan DathathriAlexandra JimboreanAthanasios KonstantinidisAndreas Simbürger

1

Page 10: ProceedingsofIMPACT2013 - cs.colostate.edu

2

Page 11: ProceedingsofIMPACT2013 - cs.colostate.edu

On the Scalability of Loop Tiling Techniques

David G. WonnacottHaverford College

Haverford, PA, U.S.A. [email protected]

Michelle Mills StroutColorado State University

Fort Collins, CO, U.S.A. [email protected]

ABSTRACTThe Polyhedral model has proven to be a valuable tool forimproving memory locality and exploiting parallelism for op-timizing dense array codes. This model is expressive enoughto describe transformations of imperfectly nested loops, andto capture a variety of program transformations, includingmany approaches to loop tiling. Tools such as the highly suc-cessful PLuTo automatic parallelizer have provided empiri-cal confirmation of the success of polyhedral-based optimiza-tion, through experiments in which a number of benchmarkshave been executed on machines with small- to medium-scaleparallelism.

In anticipation of ever higher degrees of parallelism, we haveexplored the impact of various loop tiling strategies on theasymptotic degree of available parallelism. In our analysis,we consider “weak scaling” as described by Gustafson, i.e.,in which the data set size grows linearly with the number ofprocessors available. Some, but not all, of the approaches totiling provide weak scaling. In particular, the tiling currentlyperformed by PLuTo does not scale in this sense.

In this article, we review approaches to loop tiling in thepublished literature, focusing on both scalability and imple-mentation status. We find that fully scalable tilings are notavailable in general-purpose tools, and call upon the polyhe-dral compilation community to focus on questions of asymp-totic scalability. Finally, we identify ongoing work that mayresolve this issue.

1. INTRODUCTIONThe Polyhedral model has proven to be a valuable tool forimproving memory locality and exploiting parallelism for op-timizing dense array codes. This model is expressive enoughto express a variety of program transformations, includingmany forms of loop tiling, which can improve cache line uti-lization and avoid false sharing [16, 37, 36], as well as in-crease the granularity of concurrency.

For many codes, the most dramatic locality improvementsoccur with time tiling, i.e., tiling that spans multiple itera-tions of an outer time-step loop. In some cases, the degreeof locality can increase with the number of time steps in atile, providing scalable locality [39]. For non-trivial exam-ples, time tiling often requires loop skewing with respect tothe time step loop [27, 39], often referred to as time skew-ing [39, 38]. This transformation typically involves imper-fectly nested loops, and was thus not widely implemented

before the adoption of the polyhedral approach. However,the PLuTo automatic parallelizer [19, 6] has demonstratedconsiderable success in obtaining high performance on ma-chines with moderate degrees of parallelism by using thistechnique to automatically produce OpenMP parallel code.

Unfortunately, the specific tiling transformations that havebeen implemented and released in tools like PLuTo involvepipelined execution of tiles, which prevents full concurrencyfrom the start. The lack of immediate full concurrency issometimes dismissed as a start-up cost that will be triv-ially small for realistic problem sizes. While this may betrue for the degrees of parallelism provided by current multi-core processors, this choice of tiling can impact the asymp-totic degree of concurrency available if we try to scale updata set size and machine size together, as suggested byGustafson [15]. Furthermore, Van der Wijngaart et al. [35]have modeled and experimentally demonstrated the load im-balance that occurs on distributed memory machines whenusing the pipelined approach.

In this paper, we review the status of implemented and pro-posed techniques for tiling dense array codes (including theimportant sub-case of stencil codes) in an attempt to de-termine whether or not the techniques that are currentlybeing implemented are well suited to machines with higherdemands for parallelism and control of memory traffic andcommunication. The published literature on tiling for auto-matic parallelization seems to be divided into two disjointcategories: “practical” papers describing implemented butunscalable techniques for automatic parallelizers for densearray codes, and “theoretical” papers describing techniquesthat scale well but are either not implemented or not inte-grated into a general automatic parallelizer.

In Section 2 of this paper, we discuss that the approach cur-rently used by PLuTo does not allow full scaling as describedby Gustafson [15]. In Section 4, we survey other tilings thathave been suggested in the literature, classify each approachas fully scalable or not, and discuss its implementation sta-tus in current automatic parallelization tools. We also ad-dress recent work by Bondhugula et al. [3] on a tiling tech-nique that we believe will be scalable, though asymptoticscaling is not addressed in [3]. Section 5 presents our con-clusions: we believe the scalable/implemented dichotomy isan artifact of current design choices, not a fundamental lim-itation of the polyhedral model, and can thus be addressedvia a shift in emphasis by the research community.

3

Page 12: ProceedingsofIMPACT2013 - cs.colostate.edu

// update N pseudo-random seeds T times

// assumes R[ ] is initialized

for t = 1 to T

for i = 0 to N-1

S0: R[i] = (a*R[i]+c) % m

Figure 1: “Embarrassingly Parallel” Loop Nest.

2. TILING AND SCALABILITYIn his 1988 article “Reevaluating Amdahl’s Law” [15],Gustafson observed that, in actual practice, “One does nottake a fixed size problem and run it on various numbers ofprocessors”, but rather “expands [the problem] to make useof the increased facilities”. In particular, in the successfulparallelizations he described, “as a first approximation, theamount of work that can be done in parallel varies linearlywith the number of processors”, and it is “most realistic toassume run time, not problem size, is constant”. This formof scaling is typically referred to as weak scaling or scal-able parallelism, as opposed to the strong scaling needed togive speed-up proportional to the number of processors fora fixed-size problem.

Weak scaling can be found in many data parallel codes, inwhich many elements of a large array can be updated simul-taneously. Figure 1 shows a trivial example that we will useto introduce our diagrammatic conventions (following [38]).In Figure 2 each statement execution/loop iteration is drawnas an individual node, with sample values given to symbolicparameters. The time axis, or outer loop, moves from leftto right across the page. The grouping of nodes into tiles isillustrated with variously shaped boxes around sets of nodes.Arrows in the figure denote direction of flow of informationamong iterations or tiles. Line-less arrowheads indicate val-ues that are live-in to the space being illustrated. (Whencomparing our figures to other work, note that presentationstyle may vary in several ways: some authors use a time axisthat moves up or down the page; some draw data depen-dence arcs from a use to a definition, thus pointing into thedata-flow like a weather vane; some illustrate tiles as rectan-gular grids on a visually transformed iteration space, ratherthan with varied shapes on the original iteration space.)

Figure 2 makes clear the possibility of both (weak) scalableparallelism and scalable locality. In the execution of a tileof size (τ × σ), i.e., τ iterations of the t (time) loop and σiterations of the i (data) loop, data-flow does not prevent Pprocessors from concurrently executing P such tiles. Eachtile performs O(σ × τ) computations and has O(σ) live-inand live-out values; if each processor performs all updates ofone data element before moving to the next, O(τ) operationscan be performed with O(1) accesses to main memory.

Inter-iteration data-flow can constrain, or even prevent, scal-able parallelism or scalable locality. For the code in Figure 1,we can scale up parallelism by increasing N and P , or wecan scale up locality by increasing T with the machine bal-ance. However, beyond a certain point (i.e., σ = 1), we canno longer use additional processors to explore ever increas-ing values of T for a given N . (An increase in CPU clockspeed might help in this situation, though beyond a certainpoint it would likely not help performance for increasing N

1 2 3 4 5

i

1

3

4

5

6

2

7

t

0

Figure 2: Iteration Space of Figure 1 with T=5, N=8,tiled with τ = 5, σ = 4.

for a given T .)

As we show in the next two sections, scalable parallelismmay be constrained not only by the fundamental data-flow,but also by the approach to parallelization.

3. PIPELINED PARALLELISMIn a Jacobi stencil computation, each array element isupdated as a function of its value and its neighbors’values, as shown (for a one-dimensional array) in Fig-ure 3. Thus, we must perform time skewing to tile theiteration space (except in the degenerate case of τ = 1,which prevents scalable locality). Figure 4 illustrates theusual tiling performed by automatic parallelizers suchas PLuTo, though for readability our figure shows farfewer loop iterations per tile. Nodes represent execu-tions of statement S1; for simplicity, executions of S2

are not shown. (The same data-flow also arises from adoubly-nested execution of the single statement A[t%2,i]

= (A[(t-1)%2,i-1]+2*A[(t-1)%2,i]+A[(t-1)%2,i+1])/4,but some tools may not recognize the program in this form.)

Array data-flow analysis [11, 22, 23] is well understood forprograms that fit the polyhedral model, and can be usedto deduce the data-flow arcs from the original imperativecode. The data-flow arcs crossing a tile boundary describethe communication between tiles; in most approaches totiling for distributed systems, inter-processor communica-tion is aggregated and takes place between executions oftiles, rather than in the middle of any tile. The topology ofthe inter-tile data-flow thus gives the constraints on possibleconcurrent execution of tiles. For Figure 4, concurrent ex-

4

Page 13: ProceedingsofIMPACT2013 - cs.colostate.edu

for t = 1 to T

for i = 1 to N-2

S1: new[i] = (A[i-1]+2*A[i]+A[i+1])/4

for i = 1 to N-2

S2: A[i] = new[i]

Figure 3: Three Point Jacobi Stencil.

1 2 3 4 5

i

1

3

4

5

6

2

7

t

8

Figure 4: Iteration Space of Figure 3 with T=5,N=10, tiled with τ = 2, σ = 4.

ecution is possible in pipelined fashion, in which executionof tiles progresses as a wavefront that begins with the lowerleft tile, then simultaneously executes the two tiles border-ing it (above and to the right), and continues to each waveof tiles adjacent to the just-completed wave.

As has been noted in the literature, this is not the onlyway to tile this set of iterations; however, other tilings arenot (currently) selected by fully-automatic loop parallelizerssuch as PLuTo [19]. Even the semi-automatic AlphaZ sys-tem [40], which is designed to allow programmers to exper-iment with different optimization strategies, cannot expressmany of these tilings. If such tools are to be consideredfor extreme scale computing, we must consider whether ornot the tiling strategies they support provide the necessaryscaling characteristics.

3.1 ScalabilityTo support our claim that this pipelined tiling does not al-ways provide scalable parallelism, we need only show that itfails to scale on one of the classic examples for which it hasshown dramatic success for low-degree parallelism, such asthe easily-visualized one-dimensional Jacobi stencil of Fig-

t

i

12

12

12

12

12

12

12

12

11

10

9

8

7

6

5

4

3

2

1

11

11

2

3

4

4

5

5

5

6

6

6

7

7

7

7

8

8

8

8

9

9

9

9

9

10

10

10

10

10

108

11

11

11

11

11

Figure 5: Wavefronts of Pipelined Tile Execution.

ure 3. We will first do so, and then discuss the issue inhigher dimensions.

For some problem sizes, the tiling of Figure 4 can come closeto realizing scalable parallelism: if P = N

σ+τand T

τis much

larger than P , most of the execution is done with P pro-cessors. Figure 5 illustrates the first 8τ time steps of suchan example, with P = 8, σ = 2τ , and N = P (σ + τ) (theellipsis on the right indicates a large number of additionaltime steps). The tiles executed in the first 12 waves arenumbered 1 to 12 for reference, and individual iterationsand data-flow are omitted for clarity. For all iterations af-ter 11, this tiling provides enough parallelism to keep eightprocessors busy for this problem size, and for large T therunning time approaches 1

8of the sequential execution time

(plus communication/synchronization time, which we willdiscuss later). If we double both N and P , a similar ar-gument shows the running time approaches 1

16of the now

twice-as-large sequential execution time, i.e., the same par-allel execution time, as described by Gustafson.

However, as N and P continue to grow, the assumptionthat T

τ� P eventually fails, and scalability is lost. Con-

sider what happens in Figure 5 if T = 8τ , i.e., the ellipsiscorresponds to 0 additional time steps. At this point, dou-bling N and P produces a figure that is twice as tall, but nowider; parallelism is limited to degree 8, and execution with16 processors requires 35 steps rather than the 23 needed forFigure 5 when T = 8τ (note that the upper-right region issymmetric with the lower-left). Thus, communication-freeexecution time has increased rather than remaining con-stant. Increasing T (rather than N) with P is no better,and in fact no combination of N and T increase can allow16 processors to execute twice the work of 8 in the same 23wavefronts: adding even one full row or one full column oftiles means 24 wavefronts are needed.

Figure 5 was, of course, constructed to illustrate a lack ofscalability. But even if we start with a more realistic problem

5

Page 14: ProceedingsofIMPACT2013 - cs.colostate.edu

size, i.e., with N � P and T � P , the pipelined tiling stilllimits scalability. Consider what happens we apply the tilingof Figure 5 with parameters N = 10000σ, T = 1000τ, P =100, in which over 99% of the tiles can be run with full100-fold parallelism, and then scale up N and P togetherby successive factors of ten. Our first jump in size givesN = 100000σ, T = 1000τ, P = 1000, which still has full(1,000-fold) parallelism in over 98% of the tiles.

After the next jump, to N = 1000000σ, T = 1000τ, P =10000 there is no 10,000-fold concurrency in the problem.Even if the application programmer is willing to scale upT rather than just N , in an attempt to reach full machineutilization, the execution for N = 38730σ, T = 25820τ, P =10000 still achieves 10,000-fold parallelism in only 85% of the109 tiles. No combination of N and T allows any 100,000-fold parallelism on 1010 tiles with this tiling... to maintaina given degree of parallelism asymptotically, we must scaleboth N and T with P , contrary to Gustafson’s original def-inition. While this may be acceptable for some applicationdomains, we do not presume it to be universally appropri-ate, and thus see pipelined tiling as a potential restrictionon the applicability of time tiling.

It is not always realistic to scale the number of time steps T .Bassetti et al. [4] introduce an optimization they call slidingblock temporal tiling. They indicate that in these relax-ation algorithms such as Jacobi “[g]enerally several sweepsare made”. In their experiments they use up to 8 sweeps.Zhou et al. [41] use 16,384 in their modified benchmarks. Ex-periments with the Pochoir compiler [34] used 200 time stepsbecause their cache oblivious performance optimization im-proves temporal locality. In summary, the number of timeiterations in stencil computation performance optimizationresearch varies dramatically. Work from the BeBOP groupat Berkeley [8] discusses how often multiple sweeps over thegrid within one loop occur and indicate that it may not beas common as those of us working on time skewing imagine.This makes it even more important for tiling strategies toprovide scalable parallelism that does not require the num-ber of time steps to be on par with the spatial domain.

3.2 Tile Size Choice and Communication CostThe above argument presumes a fixed tile size, ignoring thepossibility of reducing the tile size to increase the number oftiles. However, communication costs (either among proces-sors or between processors and RAM) dictate a minimal tilesize below which performance will be negatively impacted(see [38, 19] for further discussion).

Note that per-processor communication cost is not likely toshrink as the data set size and number of processors is scaledup: Each processor will need to send the same number oftiles per iteration, producing per-processor communicationcost that remains roughly constant (e.g., if processors areconnected to nearest neighbors via a network of the samedimensionality as the data set space) or rising (e.g., if pro-cessors are connected via a shared bus).

Even if we ignore communication costs entirely (i.e. in thenotoriously unscalable PRAM abstraction), tile size cannotshrink below a single-iteration (or single-instruction) tile,and eventually our argument of Section 3.1 comes into play

in asymptotic analysis.

3.3 Other Factors Influencing Scalability andPerformance

Our argument focuses on tile shape, but a number ofother factors will influence the degree of parallelism actu-ally achieved by a given tiling. As noted above, reductionsin tile size could, up to a point, provide additional paral-lelism (possibly at the cost of performance on the individualnodes).

The use of global barriers or synchronization is commonin the original code generators, but note that MPI codegenerators are under development for both Pluto and Al-phaZ. While combining different tilings and code generationschemes raises practical challenges in implementation, we donot see any reason why any of the tilings discussed in thenext section could not, in principle, be executed withoutglobal barriers within the tiled iteration space.

High performance on each node also requires attention to anumber of other code generation issues, such as code com-plexity and impact on vectorization and prefetching. Fur-thermore, these issues could be exacerbated by changes intile shape [3, Section III.B]. Thus, different tiling shapes maybe optimal for different hardware platforms or even differ-ent problem sizes, depending on the relative costs of limit-ing parallelism vs. per-node performance. Both [3, SectionIII.B] and [33] discuss these issues and the possibility ofchoosing a tiling that is scalable in some, but not all, datadimensions.

3.4 Higher-Dimensional CodesWhile the two-dimensional iteration space of the three-pointJacobi stencil is easy to visualize on paper, many of thesubtleties of tiling techniques are only evident in problemswith at least two dimensions of data and one of time. Forpipelined tiling, the conflict between scalability is essentiallythe same in higher dimensions: for a pipelined tiling of ahyper-rectangular iteration space of dimension d, eventuallythe amount of work must grow by O(kd) to achieve paral-lelism O(kd−1).

Conversely, in higher dimensions, the existence of a wave-front that is perpendicular to the time dimension (or anyother face of a hyper-rectangular iteration space) is fre-quently the sign of a parallelization that admits some formof weak scalability. However, as we will see, the parallelismof some tilings scales with only some of the spatial dimen-sions.

4. VARIATIONS ON THE TILING THEMEThe published literature describes many approaches to looptiling. In this section, we survey these approaches, group-ing together those that produce similar (or identical) tilings.Our descriptions focus primarily on the tiling that would beused for the code of Figure 3, which is used as an intro-ductory example in many of the descriptions. We illustratethe tilings of this code with figures that are analogous toour Figure 5, with gray shading highlighting a single tile

6

Page 15: ProceedingsofIMPACT2013 - cs.colostate.edu

t

i

21 3

1

1

1

2

2

2

3

3

3

Figure 6: Overlapped Tiling.

from time step two. We delve into the complexities of morecomplex codes only as necessary to make our point.

For iterative codes with intra-time-step data-flow, thetile shapes discussed below are not legal (or cannot belegally executed in an order that permits scalable paral-lelism). For example, consider an in-place update of asingle copy of a data set, e.g. the single statement A[i]

= (A[i-1]+2*A[i]+A[i+1])/4 nested inside t and i loops.Since each tile must wait for data from tiles of with lowervalues of i and the same value of t, pipelined startup isnecessary. While such code provides additional asymptoticconcurrency when both T and N increase, we see no way toallow concurrency to grow linearly with the total work to bedone in parallel. Thus, pipelined tiling does not produce ascalability disadvantage for such codes.

Note that our discussion below focuses on distinct tileshapes, rather than distinctions among algorithms used todeduce tile shape or size or the manner in which individualtiles are scheduled or assigned to processors. For example wedo not specifically discuss the “CORALS” approach [32], inwhich an iteration space is recursively subdivided into paral-lelograms, avoiding the need to choose a tile size in advanceof starting the computation. Regardless of size, variationin size, and algorithmic provenance, the information flowamong atomic parallelogram tiles still forces execution toproceed along the diagonal wavefront, and thus still limitsasymptotic scalability.

4.1 Overlapped TilingA number of projects have experimented with what is com-monly called overlapped tiling [26, 4, 2, 25, 10, 24, 19, 7,20, 41]. In overlapped tiling for stencil computations, alarger halo is maintained so that each processor can execute

t

i2

2

3

31

1

4

4

5

6

5

6

Figure 7: Trapezoidal Tiling.

more than one time step before needing to communicatewith other processors. Figure 6 illustrates this tiling. Twoindividual tiles from the second wavefront have been indi-cated with shading, one with gray and one with polkadots;the triangular polkadotted and gray region is in both tiles,and thus represents redundant computation. This overlapmeans that all tiles along each vertical wavefront can be exe-cuted in parallel while still improving temporal data locality.

In terms of parallelism scalability, overlapped tiling doesscale because all of the tiles can be executed in parallel. If atwo-dimensional tiling in a two-dimensional spatial part of astencil is used as the seed partition, then two dimensions ofparallelism will be available with no need to fill a pipeline.This means that as the data scale, so will the parallelism.

The problem with overlapped tiling is that redundant com-putation is performed. This leads to a trade-off betweenparallel scalability and execution time. Tile size selectionmust also consider the effect of the expanded memory foot-print caused by overlapped tiling.

Auto-tuning between overlapped sparse tiling and non-overlapped sparse tiling [29, 30] for irregular iteration spaceshas also been investigated by Demmel et al. [9] in the con-text of iterative sparse matrix computations where the tilingis a run-time reordering transformation [28].

4.2 Trapezoidal TilingFrigo and Strumpen [12, 13, 14] propose an algorithm forlimiting the asymptotic cache miss rate of “an idealized par-allel machine” while providing scalable parallelism. Figure 7illustrates that even a simplified version of their approach

7

Page 16: ProceedingsofIMPACT2013 - cs.colostate.edu

t

i

2

1

1

1

1

2

2

2

1

3

3

3

3

3

Figure 8: One data dimension and the time dimen-sion in a diamond tiling [33]. The diamond extendsinto a diamond tube in the second data dimension.

can enable weak scaling for the examples we discuss here(their full algorithm involves a variety of possible decompo-sition steps; our figure is based on Figure 4 of [14]). In ourFigure 7, the collection of trapezoids marked “1” can startsimultaneously; after these tiles complete, the mirror-imagetrapezoids that fill the spaces between them, marked “2”,can all be executed; after these steps, a similar pair of setsof tiles “3” and “4” complete another τ time steps of com-putation, etc. For discussion of the actual transformationused by Frigo and Strumpen, and its asymptotic behavior,see [14].

The limitation of this approach is not its scalability, butrather the challenge of implementing it in a general-purposecompiler. Tang et al. [34] have developed the Pochoir com-piler, based on a variant of Frigo and Strumpen’s techniqueswith a higher degree of asymptotic concurrency [34]. How-ever, Pochoir handles a specialized language that allows onlystencil computations. Tools like PLuTo handle a larger do-main of dense array codes; it may be possible to generalizetrapezoidal tiling to PLuTo’s domain, but we know of nosuch work.

4.3 Diamond TilingStrzodka et al. [33, 31] present the CATS algorithm for cre-ating diamond “tube” tiles in a 3-d iteration space. The dia-monds occur in the time dimension and one data dimension,as in Figure 8. The tube aspect occurs because there is notiling in the other space dimension. Each diamond tube canbe executed in parallel with all other diamond tubes withina temporal row of diamond tubes. For example, in Figure 8all diamonds labeled “1” can be executed in parallel, afterwhich all diamonds labeled “2” can be executed in parallel,

t

i

S

C

R1

1

1

S

C1

1

C

R1

1

S

C

R2

2

2

S

C2

2

C

R2

2

S

C

R3

3

3

S

C3

3

C

R3

3

Figure 9: Molecular tiling.

etc. Within each diamond tube, the CATS approach sched-ules another level of wavefront parallelism at the granularityof iteration points.

Although Strzodka et al. [33] do not use diamond tiles for 1-ddata/2-d iteration space, diamond tiles are parallel scalablewithin that context. They actually focus on the 2-d data/3-d iteration space, where asymptotically, diamond tiling onlyscales for one data dimension. The outermost level of par-allelism over diamond tubes only scales with one dimensionof data since the diamond tiling occurs across time and onedata dimension. On page 2 of [33], Strzodka et al. explic-itly state that their results are somewhat surprising in thatasymptotically their behavior should not be as good as previ-ously presented tiling approaches, but the performance theyobserve is excellent probably due to the concurrent parallelstartup that the diamond tiles provide.

Diamond tiling is a practical approach that can perform bet-ter than pipelined tiling approaches because it avoids thepipeline fill and drain issue. The diamond tube also hasadvantages in terms of intra-tile performance: fine-grainedwavefront parallelism and leveraging pre-fetchers. The dis-advantages of the diamond tiling approach are that it hasnot been expressed within a framework such as the poly-hedral model (although it would be possible, just not withrectangular tiles); that the approach does not cleanly extendto higher dimensions of data (only one dimension of diamondtiles are possible with other dimensions doing some form ofpipelined or split tiling); and that the outermost level ofparallelism can only scale with one data dimension.

4.4 Molecular TilingWonnacott [38] described a tiling for stencils that allows trueweak scaling for higher-dimensional stencils, performs no re-

8

Page 17: ProceedingsofIMPACT2013 - cs.colostate.edu

dundant work, and contains tiles that are all the same shape.However, Wonnacott’s molecular tiles required mid-tile com-munication steps, as per Pugh and Rosser’s iteration spaceslicing [21] as illustrated in Figure 9. Each tile first executesits send slice (labeled “S”), the set of iterations that producevalues that will be needed by another currently-executingtile, and then sends those values; it then goes on to executeits compute slice (“C”), the set of iterations that require noinformation from any other currently-executing tile; finally,each tile receives incoming values and executes its receiveslice (“R”), the set of iterations that require these data. Inhigher dimensions, Wonnacott discussed the possibility ofextending the parallelograms into prisms (as diamonds areextended into diamond tubes in the diamond tiling), butalso presented a multi-stage sequence of send and receiveslices to provide full scalability.

Once again, a transformation with potential for true weakscaling remains unrealized due to implementation chal-lenges. No implementation was ever released for iterationspace slicing [21]. For the restricted case of stencil com-putations, these molecular tiles can be described withoutreference to iteration space slicing, but they make extensiveuse of modulo constraints supported by the Omega Library’scode generation algorithms [18], and Omega has no direct fa-cility for generating the required communication primitives.

The developers of PLuTo explored a similar split tiling [19]approach, and demonstrated improved performance over thepipelined tiling, but this approach was not used for the re-leased implementation of PLuTo.

4.5 A New HopeRecent work on the PLuTo system [3] has produced a tilingthat we believe will address the issue of true scalability withdata set size, though the authors frame their approach pri-marily in terms of“enabling concurrent start-up”rather thanimproving asymptotic scalability. For a one-dimensionaldata set, this approach is essentially the same as the “dia-mond tiling” of Section 4.3, but for higher-dimensional sten-cils it allows scalable parallelism in all data dimensions.

Although a Jacobi stencil on a two-dimensional data set hasan“obvious” four-sided pyramid of dependences, a collectionof four-sided pyramids (or a collection of octahedrons madefrom pairs of such pyramids) cannot cover all points, andthus does not make a regular tiling. The algorithm of [3]produces, instead, a collection of six-faced tiles that appearto be balanced on one corner (these figures are shown withthe time dimension moving up the page). Various sculpturesof balanced cubes may be helpful in visualizing this tiling;those with limited travel budgets may want to search theinternet for images of the “Zabeel park cube sculpture”. Theapproach of [3] manages to construct these corner-balancedsolids in such as way that the three faces at the bottom ofthe tile enclose the data-flow.

Experiments with this approach [3] demonstrate improvedresults (vs. pipelined tiling) for current shared-memory sys-tems up to 16 cores. The practicality of this approach onsuch a low degree of parallelism suggests that the per-nodepenalties discussed in our Section 3.3 are not prohibitivelyexpensive.

While the algorithm is described in terms of stencils, andthe authors only claim concurrent startup for stencils, it isimplemented in a general automatic parallelizer (PLuTo).We belive it would be interesting to explore the full domainover which this tiling algorithm provides concurrent startup.

The authors of [3] do not discuss asymptotic complexity, butwe hope that future collaborations could lead to a detailedtheoretical and larger-scale empirical study of the scalabilityof this technique, using the distributed tile execution tech-niques of [1] or [5].

4.6 A Note on Implementation ChallengesThe pipelined tile execution shown in Figures 4 and 5 is of-ten chosen for ease of implementation in compilers based onthe polyhedral model. Such compilers typically combine alliterations of all statements into one large iteration space; thepipelined tiling can then be seen as a simple linear transfor-mation of this space, followed by a tiling with rectangularsolids. This approach works well regardless of choice of soft-ware infrastructure within the polyhedral model.

The other transformations may be more sensitive to choiceof software infrastructure, or the subtle use thereof. At thistime, we do not have an exact list of which transformationscan be expressed with which transformation and code gen-eration libraries. We are working with tools that allow thedirect control of polyhedral transformations from a text in-put, such as AlphaZ [40] and the Omega Calculator [17],in hopes of better understanding the expressiveness of thesetools and the polyhedral libraries that underlie them.

5. CONCLUSIONSCurrent work on general-purpose loop tiling exhibits a di-chotomy between largely unimplemented explorations ofasymptotically high degrees of parallelism and carefullytuned implementations that restrict or inhibit scalable paral-lelism. This appears to result from the challenge of generalimplementation of scalable approaches. The pipelined ap-proach requires only a linear transformation of the iterationspace followed by rectangular tiling, but does not providetrue scalable parallelism. Diamond tiling scales with onlyone data dimension. Overlapped, trapezoidal, and molec-ular tiling each pose implementation challenges (due to re-dundant work, non-uniform tile shape/orientation, or non-atomic tiles, respectively).

We believe automatic parallelization for extreme scale com-puting will require a tuned implementation of a general tech-nique that does not inhibit or restrict scalability; thus fu-ture work in this area must address scalability, generality,and quality of implementation. We are optimistic that on-going work by Bondhugula et al. [3] may already provide ananswer to this to dilemma.

6. ACKNOWLEDGMENTSThis work was supported by NSF Grant CCF-0943455, by aDepartment of Energy Early Career Grant DE-SC0003956,and the CACHE Institute grant DE-SC04030.

9

Page 18: ProceedingsofIMPACT2013 - cs.colostate.edu

7. REFERENCES[1] Abdalkader, M., Burnette, I., Douglas, T., and

Wonnacott, D. G. Distributed shared memory andcompiler-induced scalable locality for scalable clusterperformance. Cluster Computing and the Grid, IEEEInternational Symposium on 0 (2012), 688–689.

[2] Allen, G., Dramlitsch, T., Foster, I., Goodale,T., Karonis, N., Ripeanu, M., Seidel, E., andToonen, B. The cactus code: A problem solvingenvironment for the grid. In Proceedings of the NinthIEEE International Symposium on High PerformanceDistributed Computing (HPDC9) (Pittsburg, PA,USA, 2000).

[3] Bandishti, V., Pananilath, I., and Bondhugula,U. Tiling stencil computations to maximizeparallelism. In Proceedings of 2012 InternationalConference for High Performance Computing,Networking, Storage and Analysis (November 2012),SC ’12, ACM Press.

[4] Bassetti, F., Davis, K., and Quinlan, D.Optimizing transformations of stencil operations forparallel object-oriented scientific frameworks oncache-based architectures. Lecture Notes in ComputerScience 1505 (1998).

[5] Bondhugula, U. Compiling affine loop nests fordistributed-memory parallel architectures. Preprint,2012.

[6] Bondhugula, U., Hartono, A., and Ramanujam,J. A practical automatic polyhedral parallelizer andlocality optimizer. In In PLDI aAZ08: Proceedings ofthe ACM SIGPLAN 2008 conference on Programminglanguage design and implementation (2008).

[7] Christen, M., Schenk, O., Neufeld, E.,Paulides, M., and Burkhart, H. Manycore StencilComputations in Hyperthermia Applications. InScientific Computing with Multicore and Accelerators,J. Dongarra, D. Bader, and J. Kurzak, Eds. CRCPress, 2010, pp. 255–277.

[8] Datta, K., Kamil, S., Williams, S., Oliker, L.,Shalf, J., and Yelick, K. Optimizations andperformance modeling of stencil computations onmodern microprocessors. SIAM Review 51, 1 (2009),129–159.

[9] Demmel, J., Hoemmen, M., Mohiyuddin, M., andYelick, K. Avoiding communication in sparse matrixcomputations. In Proceedings of International Paralleland Distributed Processing Symposium (IPDPS) (LosAlamitos, CA, USA, 2008), IEEE Computer Society.

[10] Ding, C., and He, Y. A ghost cell expansion methodfor reducing communications in solving pde problems.In Proceedings of the ACM/IEEE Conference onSupercomputing (New York, NY, USA, 2001),Supercomputing ’01, ACM, pp. 50–50.

[11] Feautrier, P. Dataflow analysis of scalar and arrayreferences. International Journal of ParallelProgramming 20, 1 (Feb. 1991), 23–53.

[12] Frigo, M., and Strumpen, V. Cache obliviousstencil computations. In Proceedings of the 19th annualinternational conference on Supercomputing (NewYork, NY, USA, 2005), ICS ’05, ACM, pp. 361–366.

[13] Frigo, M., and Strumpen, V. The cache complexityof multithreaded cache oblivious algorithms. In

Proceedings of the eighteenth annual ACM symposiumon Parallelism in algorithms and architectures (NewYork, NY, USA, 2006), SPAA ’06, ACM, pp. 271–280.

[14] Frigo, M., and Strumpen, V. The cache complexityof multithreaded cache oblivious algorithms. Theor.Comp. Sys. 45, 2 (June 2009), 203–233.

[15] Gustfson, J. L. Reevaluating Amdahl’s law.Communications of the ACM 31, 5 (May 1988),532–533.

[16] Irigoin, F., and Triolet, R. Supernodepartitioning. In Conference Record of the FifteenthACM Symposium on Principles of ProgrammingLanguages (1988), pp. 319–329.

[17] Kelly, W., Maslov, V., Pugh, W., Rosser, E.,Shpeisman, T., and Wonnacott, D. The OmegaCalculator and Library. Tech. rep., Dept. of ComputerScience, University of Maryland, College Park, Apr.1996.

[18] Kelly, W., Pugh, W., and Rosser, E. Codegeneration for multiple mappings. In The 5thSymposium on the Frontiers of Massively ParallelComputation (McLean, Virginia, Feb. 1995),pp. 332–341.

[19] Krishnamoorthy, S., Baskaran, M.,Bondhugula, U., Ramanujam, J., Rountev, A.,and Sadayappan, P. Effective automaticparallelization of stencil computations. In Proceedingsof Programming Languages Design andImplementation (PLDI) (New York, NY, USA, 2007),vol. 42, ACM, pp. 235–244.

[20] Nguyen, A., Satish, N., Chhugani, J., Kim, C.,and Dubey, P. 3.5-d blocking optimization for stencilcomputations on modern cpus and gpus. InProceedings of the ACM/IEEE InternationalConference for High Performance Computing,Networking, Storage and Analysis (Washington, DC,USA, 2010), SC ’10, IEEE Computer Society,pp. 1–13.

[21] Pugh, W., and Rosser, E. Iteration slicing forlocality. In 12th International Workshop on Languagesand Compilers for Parallel Computing (Aug. 1999).

[22] Pugh, W., and Wonnacott, D. Eliminating falsedata dependences using the Omega test. In SIGPLANConference on Programming Language Design andImplementation (San Francisco, California, June1992), pp. 140–151.

[23] Pugh, W., and Wonnacott, D. Constraint-basedarray dependence analysis. ACM Trans. onProgramming Languages and Systems 20, 3 (May1998), 635–678.

[24] Rastello, F., and Dauxois, T. Efficient tiling foran ode discrete integration program: Redundant tasksinstead of trapezoidal shaped-tiles. In Proceedings ofthe 16th International Parallel and DistributedProcessing Symposium (Washington, DC, USA, 2002),IPDPS ’02, IEEE Computer Society, pp. 138–.

[25] Ripeanu, M., Iamnitchi, A., and Foster, I. T.Cactus application: Performance predictions in gridenvironments. In Proceedings of the 7th InternationalEuro-Par Conference Manchester on ParallelProcessing (London, UK, UK, 2001), Euro-Par ’01,Springer-Verlag, pp. 807–816.

10

Page 19: ProceedingsofIMPACT2013 - cs.colostate.edu

[26] Sawdey, A., and O’Keefe, M. T. Program analysisof overlap area usage in self-similar parallel programs.In Proceedings of the 10th International Workshop onLanguages and Compilers for Parallel Computing(London, UK, UK, 1998), LCPC ’97, Springer-Verlag,pp. 79–93.

[27] Song, Y., and Li, Z. New tiling techniques toimprove cache temporal locality. ACM SIGPLANNotices (PLDI) 34, 5 (May 1999), 215–228.

[28] Strout, M. M., Carter, L., and Ferrante, J.Compile-time composition of run-time data anditeration reorderings. In Proceedings of the ACMSIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI) (New York, NY,USA, June 2003), ACM.

[29] Strout, M. M., Carter, L., Ferrante, J.,Freeman, J., and Kreaseck, B. Combiningperformance aspects of irregular Gauss-Seidel viasparse tiling. In Proceedings of the 15th Workshop onLanguages and Compilers for Parallel Computing(LCPC) (Berlin / Heidelberg, July 2002), Springer.

[30] Strout, M. M., Carter, L., Ferrante, J., andKreaseck, B. Sparse tiling for stationary iterativemethods. International Journal of High PerformanceComputing Applications 18, 1 (February 2004),95–114.

[31] Strzodka, R., Shaheen, M., and Pajak, D. Timeskewing made simple. In PPOPP (2011), C. Cascavaland P.-C. Yew, Eds., ACM, pp. 295–296.

[32] Strzodka, R., Shaheen, M., Pajak, D., andSeidel, H.-P. Cache oblivious parallelograms initerative stencil computations. In ICS ’10: Proceedingsof the 24th ACM International Conference onSupercomputing (June 2010), ACM, pp. 49–59.

[33] Strzodka, R., Shaheen, M., Pajak, D., andSeidel, H.-P. Cache accurate time skewing initerative stencil computations. In Proceedings of the40th International Conference on Parallel Processing(ICPP) (Taipei, Taiwan, September 2011), IEEEComputer Society, pp. 517–581.

[34] Tang, Y., Chowdhury, R. A., Kuszmaul, B. C.,Luk, C.-K., and Leiserson, C. E. The pochoirstencil compiler. In Proceedings of the 23rd ACMsymposium on Parallelism in algorithms andarchitectures (New York, NY, USA, 2011), SPAA ’11,ACM, pp. 117–128.

[35] Van der Wijngaart, R. F., Sarukkai, S. R.,Mehra, and P. The effect of interrupts on softwarepipeline execution on message-passing architectures.In FCRC ’96: Conference proceedings of the 1996International Conference on Supercomputing:Philadelphia, Pennsylvania, USA, May 25–28, 1996(New York, NY 10036, USA, 1996), ACM, Ed., ACMPress, pp. 189–196.

[36] Wolf, M. E., and Lam, M. S. A data localityoptimizing algorithm. In Programming LanguageDesign and Implementation (New York, NY, USA,1991), ACM.

[37] Wolfe, M. J. More iteration space tiling. InProceedings, Supercomputing ’89, Reno, Nevada(Reno, Nevada, November 1989), ACM, Ed., ACMPress, pp. 655–664.

[38] Wonnacott, D. Using Time Skewing to eliminateidle time due to memory bandwidth and networklimitations. In International Parallel and DistributedProcessing Symposium (May 2000), IEEE.

[39] Wonnacott, D. Achieving scalable locality withTime Skewing. Internation Journal of ParallelProgramming 30, 3 (June 2002), 181–221.

[40] Yuki, T., Basupalli, V., Gupta, G., Iooss, G.,Kim, D., Pathan, T., Srinivasa, P., Zou, Y., andRajopadhye, S. Alphaz: A system for analysis,transformation, and code generation in the polyhedralequational model. Tech. rep., Technical ReportCS-12-101, Colorado State University, 2012.

[41] Zhou, X., Giacalone, J.-P., Garzaran, M. J.,Kuhn, R. H., Ni, Y., and Padua, D. Hierarchicaloverlapped tiling. In Proceedings of the TenthInternational Symposium on Code Generation andOptimization (New York, NY, USA, 2012), CGO ’12,ACM, pp. 207–218.

11

Page 20: ProceedingsofIMPACT2013 - cs.colostate.edu

12

Page 21: ProceedingsofIMPACT2013 - cs.colostate.edu

Memory Allocations forTiled Uniform Dependence Programs ∗

Tomofumi YukiColorado State University

Fort CollinsColorado, U.S.A.

[email protected]

Sanjay RajopadhyeColorado State University

Fort CollinsColorado, U.S.A.

[email protected]

ABSTRACTIn this paper, we develop a series of extensions to schedule-independent storage mapping using Quasi-Universal Occu-pancy Vectors (QUOVs) targeting tiled execution of poly-hedral programs. By quasi-universality, we mean that werestrict the “universe” of the schedule to those that corre-spond to tiling. This provides the following benefits: (i)the shortest QUOVs may be shorter than the fully univer-sal ones, (ii) the shortest QUOVs can be found without anysearch, and (iii) multi-statement programs can be handled.The resulting storage mapping is valid for tiled execution byany tile size.

1. INTRODUCTIONIn this paper, we discuss storage mappings for tiled pro-

grams, especially for the case when tile sizes are not knownat compile time. When the tile sizes are parameterized,most techniques for storage mappings [6, 13, 15, 20] cannotbe used due to the non-affine nature of parameterized tiling.However, we cannot combine parametric tiling with memoryre-allocation if we cannot find a legal allocation for all legaltile sizes.

One approach that can find storage mappings for paramet-rically tiled programs is the Schedule-Independent StorageMapping proposed by Strout et al. [19]. For programs withuniform dependences, schedule-independent memory allo-cation finds storage mappings that are valid for any legalexecution of the program, including tiling by any tile size.We present a series of extensions to schedule-independentmapping for finding legal and compact storage mappings forpolyhedral programs with uniform dependences.

Schedule-Independent Storage Mapping is based on whatare called Universal Occupancy Vectors (UOVs), that char-acterize when a value produced can safely be overwritten.As originally defined by Strout et al, UOVs are fully uni-versal, where the resulting allocations are valid for any legalschedule. In this paper, we restrict the UOVs to smalleruniverses and exploit their properties to efficiently find goodUOVs. In the remainder of this paper, we call such UOVsthat are not fully universal Quasi-UOVs (QUOVs) to distin-guish them from fully universal ones. Using QUOVs, we canfind valid mappings for tiled execution by any tile size, butnot necessarily valid for other legal schedules. This leads tomore compact storage mappings for cases when fully univer-sal allocation is an overkill.

∗This work was funded in part by the National Science Foun-dation, Award Numbers: 1240991 and 0917319

The restriction on the universality leads to the following:

• QUOVs may be shorter than the shortest UOV. Sincethe universe of possible schedules is restricted, validstorage mappings may be more compact. We use Man-hattan distance as the length of UOVs as a cost mea-sure when we describe optimality of a projective mem-ory allocation.

• The shortest QUOV for tiled loop programs can be an-alytically found, and the dynamic programming algo-rithm presented by Strout et al. is no longer necessary.

• Imperfectly nested loops can be handled. The origi-nal method assumed single statement programs (andhence perfectly nested loop programs.) We extend themethod by taking statement orderings, often expressedin the polyhedral model as constant dimensions, intoaccount. This is possible because we focus on tiled ex-ecution, where tiling applies the same schedule (exceptfor ordering dimensions) to all statements.

The input to our analysis is a program in polyhedralrepresentation, which can be obtained from loop programsthrough array data-flow analysis [7, 14]. Our methods canbe used for loop programs in single assignment form; analternative view used in some of the prior work [19, 20].

In addition, we present a method for isolating boundarycases based on index-set splitting [9]. UOV-based allocationassumes that every a dependence is active at all points inthe iteration space. In practice, programs have boundarycases where certain dependences are only valid at iterationspace boundaries. We take advantage of the properties ofQUOVs we develop to guide the splitting.

2. BACKGROUNDIn this section, we present the background necessary

for this paper. We first introduce the terminology used,and then present an overview of Universal Occupancy Vec-tors [19].

2.1 Polyhedral RepresentationsPolyhedral representations of programs primarily consist

of statement domains and dependences. Statement domainsrepresent the set of iteration points where a statement isexecuted. In polyhedral programs, such sets are describedby a finite union of polyhedra.

In this paper, we focus on programs with uniform depen-dences, where the producer and the consumer differ by aconstant shift. We characterize these dependences using a

13

Page 22: ProceedingsofIMPACT2013 - cs.colostate.edu

vector, which we call data-flow vector, drawn the producerto the consumer. For example, if a value produced by aniteration [i, j] is used by another iteration [i+ 1, j + 2], thecorresponding data-flow vector is [1, 2].

2.2 Schedules and Storage MappingsThe schedules are represented by affine mappings that

map statement domains to a common dimensional space,where the lexicographic order denotes the order of execu-tion. In the polyhedral literature, statement orderings arecommonly expressed as constant dimensions in the schedule.Furthermore, one can represent arbitrary orderings of loopsand statements by adding d + 1 additional dimensions [8],where d is the dimensionality of statement domains in theprograms1. To make the presentation consistent, we assumesuch schedules are always used, resulting in a 2d+ 1 dimen-sional schedule.

The storage mappings are often represented as a combi-nation of affine mappings and dimension-wise modulo fac-tors [13, 16, 19].

2.3 TilingTiling is a well known loop transformation that was orig-

inally proposed as a locality optimization [12, 17, 18, 22]. Itcan also be used to extract coarser grained parallelism, bypartitioning the iteration space to tiles (blocks) of compu-tation, some of which may run in parallel [12, 17].

Legality of tiling is a well established concept defined overcontiguous subsets of the schedule dimensions (in the rangeof the scheduling function; scheduled space), also calledbands [3]. These dimensions of the schedules are tilable,and are also known to be fully permutable [12].

The range space of the schedules given to statements ina program all refers to the common space, and thus havethe same number of dimensions. Among these dimensions,a dimension is tilable if all dependences are not violated(i.e., the producer is not scheduled after the consumer, butpossibly be scheduled to the same time stamp,) with a one-dimensional schedule using only the dimension in question.Then any contiguous subset of such dimensions forms a legaltilable band.

We call a subset of dimensions in an iteration space tobe tilable, if the identity schedule is tilable for the corre-sponding subset. The iteration space is fully-tilable if alldimensions are tilable.

2.4 Universal Occupancy VectorsA Universal Occupancy Vector (UOV) [19] is a vector that

denotes the “distance after which” when a value may safelybe overwritten in the following sense. When an iterationz is executed, its value is stored in some memory location.Another iteration z′ can reuse this same memory locationif all iterations that use the value produced by z have beenexecuted. When a storage mapping is such that z and z′

are mapped to the same memory location, the difference ofthese points, z′ − z, is called the occupancy vector.

Universal Occupancy Vector is a specialization of such vec-tors, where all iterations k that use z are guaranteed to beexecuted before z′ in any legal schedule. Since most machinemodels assume that, within a single assignment statement,

1Note that for uniform dependence programs, all statementdomains have the same number of dimensions.

reads happen before writes in a given time step, z′ may alsouse the value produced by z.

Additionally, we introduce a notion of scoping to UOVs.We call a vector v to be an UOV with respect to a set of de-pendences I, if the vector v satisfies the necessary propertyto be an UOV for a subset of all points k that use the valueproduced by z with one of the dependences in I.

Once the UOV is computed, the mapping that corre-sponds to the projection of the statement domain along theUOV is a legal storage mapping. If the UOV crosses morethan one integer points, then an array that corresponds toa single projection is not sufficient. Instead, multiple ar-rays are used in turn, implemented using modulo factors.The necessary modulo factor is the GCD of elements of theUOV.

The trivial UOV; a valid, but possibly suboptimal UOVis computed as follows.

1. Construct the set of data-flow vectors corresponding toall the dependences that use the result of a statement.

2. Compute the sum of all data-flow vectors in the con-structed set.

The above follows from a simple proposition shown below,and an observation that the data-flow vector of a dependenceis a legal UOV with respect to that dependence.

Proposition 1 (Sum of UOVs). Let u and v be, re-spectively, the UOVs for two sets of dependences U and V.Then u+ v is a legal UOV for U ∪ V.

Proof. The value produced at z is dead when z + u canlegally be executed with respect to the dependences in U ,and similarly for V at z + v. Since there is a path from z toz + u + v by following the edges z + u and z + v (in eitherorder), the value produced at z is guaranteed to be used byall uses, z + u and z + v, when z + u + v can legally beexecuted.

The optimality of UOVs without any knowledge of sizeor shape of the iteration space is captured by the length ofthe UOV. However, the length that should be compared isnot the Euclidean length, but the Manhattan distance. Wediscuss the optimality of UOV-based allocations and otherprojective allocations in Section 7.

3. OVERVIEW OF OUR APPROACHUOV-based allocation give legal mappings even for sched-

ules that cannot be implemented as loops. For example,even a run-time work stealing scheduler can use UOV-basedallocation. However, this is obviously an overkill if we onlyconsider schedules that can be implemented as loops.

The important change in perspective is that we are notinterested in schedule-independent storage mappings, al-though the concept of UOV is used. We are only interestedin using UOV-based allocation in conjunction with tiling.Thus, our allocation is partially schedule-dependent. Theoverview of our storage mapping strategy is as follows:

1. Extract polyhedral representations of programs (arrayexpansion.)

2. Perform schduling and apply the schedules as transfor-mations to the iteration space2. After the transforma-tion, lexicographic scan of the resulting iteration space

2This can be viewed as pre-processing to code generation [2].

14

Page 23: ProceedingsofIMPACT2013 - cs.colostate.edu

respects the schedules. The resulting space should be(partially) tilable to take advantage of our approach.

3. Apply UOV-guided index-set splitting (Section 6.)This step attempts to isolate boundaries of statementdomains that negatively influence storage mappings.

4. Apply QUOV-based allocation (Section 4.) Our pro-posed storage mapping based on extensions to theUOVs are applied to each statement after the split-ting. Although inter-statement sharing of arrays maybe possible, such optimization is beyond the scope ofthis paper.

The order of presentation does not follow the above fortwo reasons. One is that the UOV-guided splitting is anoptional step that can further optimize memory usage. Inaddition, splitting introduces multiple statements to the pro-gram, and requires our extension to handle multiple state-ments presented in Section 5.

4. QUOV-BASED ALLOCATION FORTILED PROGRAMS

In this section, we present a series of formalism to ana-lytically find the shortest QUOV. We first develop a lemmathat can eliminate dependences while constructing UOVs.We then apply the lemma to find the shortest QUOVs indifferent contexts.

4.1 Relevant Set of Dependences for UOVConstruction

The trivial UOV, which also serves as the starting pointfor finding the optimal UOV, is found by taking the sumof all dependences. However, this formulation may lead tosignificantly inefficient starting points. For example, if twodependences with data-flow vectors [1, 0] and [2, 0] exist, theformer dependence may be ignored during UOV construc-tion since a legal UOV-based allocation using only the latterdependence is also guaranteed to be legal for the former de-pendence.

We may refine both the construction of the trivial UOVand the optimality algorithm by reducing the set of depen-dences considered during UOV construction. The optimal-ity algorithm presented by Strout et al. [19] searches a spacebounded by the length of trivial UOV using dynamic pro-gramming. Therefore, reducing the number of dependencesto consider will improve both the trivial UOV and the dy-namic programming algorithm.

The main intuition is that if a dependence can be tran-sitively expressed by another set of dependences, then it isthe only dependence that needs to be considered. This isformalized in the following lemma.

Lemma 1 (Dependence Subsumption). If a depen-dence f can be expressed as compositions of dependences ina set G, where all dependences in G are used at least once inthe composition, then a legal UOV with respect to f is alsoa legal UOV with respect to all elements of G.

Proof. Given a legal UOV with respect to a single de-pendence f , the value produced at z is preserved at leastuntil z′ defined by f(z′) = z, can be executed. Let the setof dependences in G be denoted as gx, 1 ≤ x ≤ |G|. Sincecomposition of uniform functions is associative and commu-tative, there is always a function g∗ obtained by composing

dependences in G, such that f = g∗ ◦ gx for each x. Thus,all points z′′, gx(z′′) = z, are executed before z′ for allx. Therefore, a legal UOV with respect to f is guaranteedto preserve the value produced at z until all points thatdirectly depend on z by a dependence in set G have beenexecuted.

Finding a composition in the above can be implemented asan integer linear programming problem. The problem mayalso be viewed as determining if a set of vectors are linearlydependent when restricted to positive combinations. Theunion of all sets G, called subsumed dependences, found inthe initial set of dependences can be ignored when construct-ing the UOV.

Applying Lemma 1 may significantly reduce the num-ber of dependences to be considered. However, the triv-ial UOV of the remaining dependences may still not bethe shortest UOV. For example, consider data-flow vectors[1, 1], [1,−1], [1, 0]. Although the vectors are independentby positive combinations, the trivial UOV [3, 0] is clearlylonger than another UOV [2, 0]. Further reducing the setof dependences to consider requires a variation of Lemma 1that allows f also to be a composition of dependences. Thisleads to complex operations, and the dynamic programmingalgorithm for finding optimal UOV by Strout et al. [19] maybe a better alternative. Instead of finding the shortest UOVin the general case, we show that such UOV can be foundvery efficiently for a specific context, namely tiling.

4.2 Finding the Shortest QUOV for Tiled Pro-grams

When UOV-based allocation is used in the specific con-text of tiling, the shortest QUOV can be analytically found.If we know that the program is to be tiled, we can adddummy dependences to restrict the universality of the stor-age mapping, while maintaining tilability. In addition, wemay assume that the dependences are all non-positive (forthe tilable dimensions) as a result of pre-scheduling step toensure the legality of tiling. For the remainder of this sec-tion, the“universe”of UOVs is one of the following restricteduniverses: fully tilable, fully sequential, and mixed sequen-tial and tilable.

Note that the expectation is that “tilable” iteration spacesare tiled in a later phase. We analyze iteration spaces thatwill be tiled using QUOV, and then apply tiling. We alsoassume that the iteration points are scanned in the lexico-graphic order within a tile.

Theorem 1 (Shortest QUOV, Fully Tilable).Given a set of dependences I in a fully tilable space, theshortest QUOV u for tiled execution is the element-wisemaxima of data-flow vectors of all dependences in I.

Proof. Let the element-wise maxima of all data-flow vec-tors be the vector m, and fm be a dependence with data-flowvector m. For unit vectors ud in each of the d dimensions, weintroduce dummy dependences fd with data-flow vector ud.Because these dependences have non-negative components,the resulting space is still tilable. For all dependences in Ithere exists a sequence of compositions with the dummy de-pendences to transitively express fm. Using Lemma 1, theonly dependence to be considered in UOV construction cantherefore be reduced to fm, which has the trivial UOV of m.

It remains to show that no QUOV shorter than m existsfor the set of dependences I. The shortest QUOV is defined

15

Page 24: ProceedingsofIMPACT2013 - cs.colostate.edu

`

i

j Data-flow Vectors

Dummy Vectors

Element-wise Maxima

Bounds by Maxima

Points with M. Distance 4

Figure 1: Illustration of Theorem 1 for the set of depen-dences with data-flow vectors [2, 0], [2, 1], [1, 2], and [0, 2].The element-wise maxima of the data-flow vectors corre-spond to the shortest UOV. The value produced by the bot-tom left iteration is used by the destination of the data-flowvectors. The data-flows induced by dummy dependencesguarantees that the iteration pointed by the element-wisemaxima is only executed after all iterations that depend onthe bottom left. None of the other iterations with the sameManhattan distance (4) can be reached, since it requiresbackward data-flow along at least one of the axes.

by the closest3 point from z that can be reached from alluses of z by following the dependences. Since the choice ofz does not matter, let us use the origin, ~0, to simplify ourpresentation. This allows us to use the data-flow vectorsinterchangeably with coordinate vectors.

Then, the hyper-rectangle with diagonal m includes allI, and all bounds of the hyper-rectangle are touched byat least one dependence. Since all dependences in a fullytilable space are restricted to have non-negative data-flowvectors, no points within the hyper-rectangle can be reachedby following dependences. Thus, it is clear that m is theclosest common point that can be reached by those thattouch the bounds.

The theorem is illustrated in Figure 1, and is contrastedwith the trivial UOV used by Strout et al. [19] in Figure 2.

The basic idea of inserting dummy dependences to restrictthe possible schedule can be used beyond tilable schedules.One important corollary for sequential execution is the fol-lowing.

Corollary 1 (Sequential Execution). Given a setof dependences I in an n-dimensional space where lexico-graphic scan of the space is a legal schedule. Let m be thelexicographic maximum of the data-flow vectors of all depen-dences in I. Then the shortest QUOV u for lexicographicexecution is either m or the vector [m1 + 1, 0, · · · , 0] wherem1 is the first element of m.

Proof. For sequential execution, we may introducedummy dependences to any lexicographically precedingpoint. Then the dependence, with a data-flow vector whosefirst element is m1 can can subsume other dependences withlower values in the first element according to Lemma 1 byintroducing appropriate dummy dependences.

3Shortest and closest are both in terms of Manhattan dis-tance.

i

j

Dynamic Programming Search Space

Trivial UOVShortest QUOV

Data-flow Vectors

Shortest UOV

Figure 2: Comparison against the trivial UOV computedas proposed by Strout et al. [19] for the same set of de-pendences as in Figure 1. The trivial UOV is [5, 5] and itbecomes the radius on the bounds of the search space forthe dynamic programming algorithm proposed by Strout etal. [19]. In contrast, Theorem 1 gives the shortest QUOVby simply computing the element-wise maxima of the data-flow vectors. Furthermore, the shortest fully universal UOVis twice as long as the shortest QUOV, since the unit lengthdummy dependences cannot be assumed. The search spaceis bounded by a sphere since Euclidean distance is used byStrout et al. [19], but can be adapted to Manhattan distance.

For the remaining dependences there are two possibilities:

• We may use dummy dependences of the form[1, ∗, · · · , ∗] to let a dependence with data-flow vector[m1 + 1, 0, · · · , 0] subsume all remaining dependences.

• We may use m as the QUOV following Theorem 1.

It is obvious that when the former options is used, [m1 +1, 0, · · · , 0] is the shortest. The optimality of the latter casefollows from Theorem 1. Thus, the shortest UOV is theshortest among these two options.

Note that m can be shorter than [m1+1, 0, · · · , 0] only whenm = [m1, 0, · · · , 0]. In addition, although the above coro-rally can be applied to tilable iteration spaces, the tilabilitymay be lost due to memory-based dependences introducedby the allocation.

The allocation given by the above corollary is notschedule-independent at all. It is an analytical solutionto the storage mapping of uniform dependence programs,where the schedule is the lexicographic scan of the iterationspace.

The following corollary can trivially be established by thecombination of the above.

Corollary 2 (Sequence of Tilable Spaces).Given a set of dependences I in a space where a subset of

16

Page 25: ProceedingsofIMPACT2013 - cs.colostate.edu

for (i=0:N)

S1[i] = foo();

for (j=0:N)

S2[j] = bar(S1[j]);

(a) When θS1 = (i→ 0, i, 0) and θS2 = (j → 1, j, 0).

for (i=0:N)

S1[i] = foo();

S2[i] = bar(S1[i]);

(b) When θS1 = (i→ 0, i, 0) and θS2 = (j → 0, j, 1).

Figure 3: Two possible schedules for statements S1 and S2.Note that statement S1 in Figure 3a requires O(N) memorywhereas it only requires a scalar in Figure 3b, although thecode shown is still in single assignment.

the dimensions are tilable, and lexicographic scan is legal forother dimensions, the shortest QUOV u for tiled executionof the tilable space, and sequential execution of the rest isthe combination of vectors computed for each contiguoussubset of either sequential or tilable spaces.

Note that the above corollary only takes effect when thereare sequential subsets with at least two contiguous dimen-sions. When a single sequential dimension is surroundedby tilable dimensions, its element-wise maxima and lexico-graphic maxima are equivalent.

Using the above, the shortest QUOV for sequential, tiled,or a hybrid combination, can be computed very efficiently.

5. HANDLING OF PROGRAMS WITHMULTIPLE STATEMENTS

In many programs, there are multiple statements depend-ing on each other. The original formulation of UOVs are forsingle statement programs [19]. In this section, we show thatthe concept can be adapted to multi-statement programs,knowing that we restrict the universe to tiled execution ofthe iteration space.

5.1 Limitations of UOV-based AllocationAllocations based on UOVs have a strong property that

they are valid for any legal schedule. Here, the schedule isnot limited to affine schedules in the polyhedral model, andtime stamps to each operation can be assigned arbitrarily, aslong as they respect the dependences. The concept of UOVapplies to reuse among writes to a single common space,and relies on the fact that every iteration writes to the samespace (or array.) Different statements may write to differentarrays, in programs with multiple statements, and henceUOV cannot be directly used.

For example, consider a program with two statements S1

and S2:

• DS1 = {i|0 ≤ i ≤ N}• DS2 = {j|0 ≤ j ≤ N}

where the dependence is such that iteration x of S1 mustexecuted before x of S2.

Figure 3 illustrates two possible schedules and its implica-tions on memory usage. Note that we do not discuss storage

mapping for S2, since it is not used within the code frag-ment above. A fully universal UOV-based allocation wouldhave to take account for such variations of a schedule, butsuch extension may not even make sense. When two state-ments are scheduled differently, the dependence between twostatements in the scheduled space may no longer be uniform.

When we apply the concept of UOV for tiling, we areno longer interested in arbitrary schedules. We first applyall non-tiling scheduling decisions before performing stor-age mapping. Therefore, the only change in the executionorder comes from a tiling transformation viewed as a post-processing, so the same “schedule” applied to all statementsinvolved. This allows us to extend the concept of UOVs toimperfectly nested loops.

5.2 Handling of Statement OrderingWhen the ordering dimensions are represented as constant

dimensions, the elements in the UOV require special han-dling. In the original formulation there is only one state-ment, and thus every iteration point writes to the samearray. When multiple statements exist in a program, theiteration space of a statement is a subset of the combinedspace, and are made disjoint by statement ordering dimen-sions. Thus, not all points in the common space correspondto a write, and this affects how the UOV is interpreted.

Consider the program in Figure 3b. The only dependenceinvolving S1 has data-flow [0, 0, 1], and since it is the onlydependence, it is the shortest UOV. Literal interpretationof this vector as UOV means that the iteration z + [0, 0, 1]can safely overwrite the value produced by z. However, thisinterpretation does not make sense in the context of by-statement allocation, since statement S2 at z+[0, 0, 1] writesto a different array.

We handle multi-statement programs by removing d outd + 1 constant dimensions for the purpose of UOV-basedanalysis. We apply the following rule to remove constantdimensions by transferring the information carried by thesedimensions to others. Let v be a data-flow vector of a depen-dence in the scheduled space. We first apply the followingrule to the vector v:

• For each constant dimension x > 0, where vx > 0, setvx−1 = max(vx−1, 1)

Once the above rule is applied, all the constant dimensions,except for the first one, are removed to obtain d + 1 di-mensional data-flow vector. We repeatedly apply the aboveto all dependences in the program, resulting with a set ofdata-flow vectors with d+ 1 dimensions.

We justify the above rule in the following. When the con-stant dimension of the data-flow is greater than 0, it meansthat some textually later statement uses the produced value.With respect to this particular dependence, the memory lo-cation may be reused once this textually later statement hasbeen executed. However, there is always exactly one itera-tion of a specific statement in a constant dimension, since itis an artificial dimension for statement ordering. Therefore,the earliest possible iteration that can overwrite the memorylocation in question is in the next iteration of the immediatesurrounding loop dimension.

For example, the only dependence of S1 in Figure 3b is[0, 0, 1]. The value produced by S1 at i is used by S2 ati, but only overwritten by another iteration of S1 at i +1. Therefore, we transfer the dependence information to a

17

Page 26: ProceedingsofIMPACT2013 - cs.colostate.edu

for (t=0:T)

for (i=0:N)

A[i] = foo(A[i]); //S1

for (i=1:N)

A[i] = bar(A[i-1], A[i]); //S2

Figure 4: Example to illustrate influences of boundarycases. Note that the value produced by S1 is last used byS2 of the same outer loop iteration, except for when i = 0,in which case it is used again by S1 at [t+ 1, 0].

preceding dimension.Projecting a d+1 dimensional domain along a vector does

indeed produce a domain with d dimensions. This is be-cause in the presence of imperfect nests, d dimensional stor-age may be required for d dimensional statements even withuniform dependences. Consider the code in Figure 3a. Theonly use of S1 is by S2 with data-flow [1, 0, 0] in the sched-uled space. Applying the rule described above removes thelast dimension, yielding [1, 0] as the QUOV. Projecting thestatement domain of S1 (in the scheduled space with d con-stant dimensions removed) along the vector [1, 0] gives a two-dimensional domain: {i, x|0 ≤ i ≤ N∧x = 0}. Although thedomain is two-dimensional, it is effectively one-dimensionalbecause of the equality in the constant dimension.

It is also important to note a special case when the UOVtakes the form: [0, · · · , 0, 1]. When the last constant di-mension is the only non-zero entry, it is obvious that thestatement requires only a scalar, since its immediately con-sumed.

6. UOV GUIDED INDEX SET SPLITTINGIn the polyhedral representation of programs there are

usually boundary cases that behave differently from the rest.For instance, the first iteration of a loop may read frominputs, whereas successive iterations use values computedby previous iterations.

In the polyhedral model, storage mapping is usually com-puted for each statement. With pseudo-projective alloca-tions, the same allocation must be used for all points in thestatement domain. Thus, dependences that only exist at theboundaries influence the entire allocation.

For example, consider the code in Figure 4. The valueproduced by S1 at [t, i] is last used by S2 at [t, i+1] for i > 0.However, the value produced by S1 at [t, 0] is last used by S1

at [t + 1, 0]. Thus, the storage mapping for S1 must ensurethat a value produced at [t, i] is live until [t + 1, i] for allinstances of S1. This clearly leads to wasteful allocations,and our goal is to avoid them.

One solution to the problem is to apply a form of index setsplitting [9] such that the boundary cases and common caseshave different mappings. In the example above, we wish touse piece-wise mappings for S1 at [t, i] where the mapping isdifferent for two disjoint sets i = 0 and i > 0. This reducesthe memory usage from an array of size N + 1 to 2 scalars.

Once, the pieces are computed, application of piece-wisemappings can done through splitting the statements (nodes)as defined by the pieces, and then applying a separate map-ping to each of the statements after split. Thus, the onlyinteresting problem that remain is finding meaningful splits.

In this section, we present an algorithm for finding the

split with the goal of minimizing memory usage. The algo-rithm is guided by Universal Occupancy Vectors and worksbest with UOV-based allocations. The goal of our indexset splitting is to isolate boundaries that require longer life-time than the main body. Thus, we are interested in a sub-domain of a statement with a different dependence patternthan that in the rest of the statement’s domain. We focus onboundary domains that contain at least one equality. Theapproach may be generalized to boundary planes of constantthickness using Thick Face Lattices [11].

The original index-set splitting [9] aimed at finding betterschedules. The quality of a split is measured by its influenceon possible schedules: whether different scheduling functionsfor each piece yields a better schedule.

In our case, the goal is different. Our starting point isthe program after affine scheduling, and we are now inter-ested in finding storage mappings. When the dependencepattern is the same at all points in the domain, splittingcannot improve the quality of the storage mapping. Sincethe dependence pattern is the same, the same storage map-ping will be used for all pieces (with a possible exception ofthe cases when the split introduces equalities or other prop-erties related to shape of the domains). Because two pointsthat may have been in the nullspace of the projection maynow be split into different pieces, the number of points thatcan share the same memory location may be reduced as theresult of splitting.

Thus, as a general measure of quality, we seek to ensurethat a split influences the choice of storage mapping for eachpiece. The obvious case when splitting is useless is when adependence function at a boundary is also in the main part.We present Algorithm 1 based on this intuition to reducethe number of splits.

The intuition of the algorithm is that we start with alldependences with equalities in their domain as candidatepieces. Then we remove some of the dependences wheresplitting does not improve the allocation from candidatepieces. The obvious case is when the same dependence func-tion exists in the non-boundary cases (i.e., dependences withno equalities in their domain). In addition, more sophisti-cated exclusion is performed using Theorem 1.

The algorithm can also be easily adapted for non-uniformprograms. It may also be specialized/generalized by addingmore dependence elimination rules to Step 2. This requiresa method similar to Lemma 1 for other memory allocationmethods.

ExampleLet us describe the algorithm in more detail with an ex-ample. The statement S1 in Figure 4 has three data-flowdependences:

• I1 = S1[t, i]→ S2[t, i] when 0 ≤ i ≤ N

• I2 = S1[t, i]→ S2[t, i+ 1] when i < N

• I3 = S1[t, i]→ S1[t+ 1, i] when i = 0

The need for index-set splitting does not arise until someprior scheduling fuses the two inner loops. Let the schedul-ing functions be:

• θS1 = (t, i→ 0, t, 0, i, 0)

• θS2 = (t, i→ 0, t, 0, i, 1)

18

Page 27: ProceedingsofIMPACT2013 - cs.colostate.edu

Algorithm 1 UOV-Guided Split

Input:

I: Set of uniform dependences that depend on a statement S. A dependence is a pair 〈f,D〉 where f is the dependencefunction and D is a domain. The domain is the constraints on the producer statement.

Output:

P: A partition of DS , the domain of S, where each element defines a piece of the split.

Algorithm:We first inspect the domain of dependences in I to detect equalities.Let

Ib be the set of dependences with equalities, and

Im be the set of those without equalities.

Then,

1. foreach 〈f,D〉 ∈ Ib,

if ∃〈g,E〉 ∈ Im; f = g then remove 〈f,D〉 from Ib2. Further remove dependences from Ib using the following if applicable:

(a) Theorem 1 and its corollaries. The following steps are only for Theorem 1.

Let m be the element-wise maxima of dataflow vectors in Im.foreach 〈f,D〉 ∈ Ib– Let v be the dataflow vector of f .– if ∀i : vi ≤ mi then remove 〈f,D〉 from Ib.

3. Group the remaining dependences in Ib into groups Gi, 0 ≤ i ≤ n, where ∀X,Y ∈ Gi;DX ∩ DY 6= ∅. In other words,group the dependences with overlapping domains in the producer space.

4. foreach i ∈ 0 ≤ i ≤ n, Pi =⋃

∀X∈Gi

DX

5. if n ≥ 0 then Pn+1 = DS \n⋃

i=0

Pi else P0 = DS

The data-flow vectors in the scheduled space, after remov-ing constant dimensions (we also remove the outer-most onesince there is only one loop at the outer-most level) are:

• I′1 = [0, 1] when 0 ≤ i ≤ N

• I′2 = [0, 1] when i < N

• I′3 = [1, 0] when i = 0

As the pre-processing step, we separate the dependencesinto two sets based on the equalities in the domain:

• Ib = {I′3}; those with equalities, and

• Im = {I′1, I

′2}; those without equalities.

Then we use Step 1 to eliminate identical dependences. Ifthe same dependence function is both in the boundary andthe main body, separating the boundary does not reduce thenumber of distinct dependences to be considered. There-fore, splitting such domain does not positively influence thestorage mapping, and hence is removed from further consid-

eration. Since I′3 is different from the other two, this step

does not change the set of dependences for this example.

Step 2a is the core of this algorithm. The general ideais the same as Step 1, but we use additional properties ofUOV-based allocation to identify more cases where splittingdoes not positively influence the storage mapping.

Theorem 1 states that if all elements of a data-flow vectorare less than the corresponding element of the element-wisemaxima of all data-flow vectors under consideration, the de-pendence does not influence shortest QUOV. Therefore, ifthe candidate dependence to split does not contribute tothe element-wise maxima, the split is useless in terms offurther shortening the QUOV. As an illustration, considerthe rectangle in Figure 1 defined by the element-wise max-ima. If separating a dependence does not shrink the size ofthe rectangle, the length of QUOV cannot be shortened.

In this example, the element-wise maxima of dependencesin Im is [0, 1]. However, the boundary dependence has data-flow vector [1, 0], and when combined, the element-wise max-ima becomes [1, 1]. Therefore, the boundary dependencedoes contribute to the element-wise maxima, and is not re-moved from the candidate set in Step 2a. The dependencesthat are left in the set Ib after this step are the set of de-pendences that will be split from the main body.

Step 3 is a grouping step, where dependences to be splitare grouped into those that have overlap in their domains.

19

Page 28: ProceedingsofIMPACT2013 - cs.colostate.edu

Two sub-domains of a statement cannot be split into sepa-rate statements, unless the computation is duplicated. Al-though computing the values redundantly may be an option,we enforce the two statements to be jointly split as one ad-ditional statement. This step is irrelevant for our example,since our example only has one dependence in set Ib.

The last step is a cleanup step, which adds the remainderof splits to the set of partitions.

Let S1b be the statement after splitting the domain of I3from S1, and S1a be the remainder of the main body of S1.The domain of statements in the program are now:

• DS1a = {t, i|0 ≤ t ≤ T ∧ 1 ≤ i ≤ N}• DS1b = {t, i|0 ≤ t ≤ T ∧ i = 0}• DS2 = {t, i|0 ≤ t ≤ T ∧ 1 ≤ i ≤ N}

and the dependences are:

• I∗1 = S1a[t, i]→ S2[t, i] when 1 ≤ i ≤ N• I∗2 = S1a[t, i]→ S2[t, i+ 1] when i < N

• I∗3 = S1b[t, i]→ S1a[t+ 1, i] when i = 0

The QUOV for S1a is [0, 0, 0, 0, 1] in the scheduled spacewith constant dimensions, which is the aforementioned spe-cial case, and only a scalar is required for S1a. The QUOVfor S1b is [1, 0] after removing the constant dimensions. Theprojection of its domain along this vector is also a constant,due to the equality in its domain. Thus, the storage require-ment for S1 in the original program becomes two scalars,which is much smaller than what is required without thesplitting.

7. RELATED WORKThere is a lot of prior work on storage mappings for poly-

hedral programs [1, 4, 5, 13, 15, 19, 20, 21]. Most approachesfocus on the case when the schedule for statements are given.

7.1 Efficiency of UOV-based AllocationBy the nature of its strategy, UOV-based allocation, in-

cluding those using QUOVs, cannot yield more compactstorage mappings compared to alternative strategies for aspecific schedule. However, UOV-based allocation may notbe as inefficient as one might think for programs that required − 1 dimensional storage. The misconception is (at leastpartially) due to the trade-off between memory usage andparallelism that is often overlooked. Consider the followingcode fragment with Smith-Waterman(-like) dependences.

for (i=1:N)

for (j=1:M)

H[i,j] = foo(H[i-1,j], H[i,j-1]);

As illustrated in Figure 5, UOV-based allocation, evenwith QUOVs, gives a storage mapping that use O(N + M)memory for this program. However, the program cannotbe tiled if O(N) or O(M) storage mappings are used dueto memory-based dependences. One approach to still ac-complish tiled execution of this program is to transform theprogram such that the iteration space is tilable even whenthe memory-based dependences are under consideration.

For the above program, this requires a skewing as depictedin Figure 6. Once the skewing is applied so that O(M)

i

j

(a) Iteration space and pos-sible storage mappings

i

j

(b) Set of live values in tiledexecution

Figure 5: Iteration space of Smith-Waterman(-like) codeand possible storage mappings. Figure 5a shows two pos-sible storage mappings, either horizontal or vertical projec-tion of the iteration space. The size of memory is O(M)or O(N), depending of the direction of the projection. Fig-ure 5b shows the iteration space after tiling, and the val-ues that are live after executing two tiles on the diagonal.Observe that neither projection (horizontal or vertical) canstore all the live values. One example of a valid projec-tive storage mapping is the projection along [1, 1] that useO(N +M) memory.

storage mapping can be tiled, the UOV-based allocation forthe same program will also yield a storage mapping withO(M) memory. Note that the skewed iteration space hasless parallelism (longer critical path length) when comparedto the original rectangular iteration space. The parallelismis effectively traded off with decreased memory usage.

For this example, when the other condition (amount ofparallelism) is equal, the allocation using UOVs is no worsethan what is considered a more efficient allocation. Thisobservation can be generalized to other instances of uniformdependence programs, such as Jacobi/Gauss-Seidel stencils.How to jointly find schedule and storage mappings to exploresuch trade-off is still an open problem.

7.2 Optimality of Projective AllocationsThere is also an upper bound on dimension-wise optimal-

ity. Quillere and Rajopadhye [15] show that the number oflinearly independent projection vectors can be viewed as theprimary criterion for optimality of storage mappings.

UOV-based allocation, as originally defined, was limitedto allocations with one projection vector by its nature, andtherefore, is limited to finding d− 1 dimensional storage ford dimensional iteration space. The additional optimizationswe describe in Section 5 allow us to overcome this limitation,but for many, if not most, uniform dependence programs,the lower bound on the number of memory dimensions isd − 1. Therefore, UOV-based allocations are no more thana constant fold more expensive.

Now, the constant factor can become important, butQuillere and Rajopadhye [15] give a number of examplesto illustrate that the problem is subtle when the size ofthe iteration domain is parameterized. For some programsLefebvre-Feautrier [13] gives a better memory footprint thanthe Quillere-Rajopadhye method, while for others, it isworse.

It is easy to show that any multiple, by some integergreater than one, of a legal UOV uses more memory. Two

20

Page 29: ProceedingsofIMPACT2013 - cs.colostate.edu

for (i=1:N)

for (j=i+1:M+i)

x = j-i;

H[i,x] = foo(

H[i-1,x ],

H[i ,x-1]

);

(a) Code

i

j

(b) Iteration space

Figure 6: The code and iteration space after skewing theoriginal program. It is easy to see that [1, 1] is the short-est UOV for this program. With this UOV, the amount ofmemory used is O(M).

UOVs that are not constant multiples of each other are of-ten difficult to compare. For example, memory usage of twoallocations based on UOVs [1, 1] and [2, 0] are only paramet-rically comparable. With N×M iteration space, the formeruse N + M and the latter use 2N . The optimal allocationin such case depends on the values of N and M that are notknown until run-time.

Informally, increasing the Manhattan distance will alwaysincrease memory usage by either increasing the GCD, andhence increasing the mod factor, or by increasing the angle ofthe projection, and hence increasing the size of the projectedspace.

7.3 Parametric Tile SizesOur main motivation for QUOVs is to find storage map-

pings that are valid for tiled execution by any (legal) tilesizes. Schedule-dependent approaches cannot provide stor-age mappings for parametric tile sizes due to its non-affinenature.

It is possible to provide tile coordinates and tile sizes asadditional parameters to the polyhedral representation, andthen apply polyhedral storage mappings for each tile indi-vidually. Although this approach makes sense in certaincontexts (e.g., [10]), it is not suitable for others (e.g., sharedmemory parallelization.) As shown in Figure 6, UOV-basedallocation maps iterations from different tiles to a singlememory location, allowing inter-tile reuse of storage. Thereis no need to transfer data from one tile to another in UOV-based allocation.

7.4 Affine Occupancy VectorsThies et al. [20] present an extension to the concept of

UOV to affine schedules, named Affine Occupancy Vectors.They restrict the universality to affine scheduling, ratherthan the full universe. Although the idea of restricting theuniversality has some similarities with our work, the re-stricted universe is still the entire affine scheduling space.In addition, they only handle one-dimensional affine sched-ules.

8. CONCLUSIONSWe have presented a series of extensions to the Schedule-

Independent Storage Mapping. Although UOVs were orig-inally used for schedule-independent mappings, our exten-sions restrict the universality of the occupancy vectors toanalyze a specific class of schedules; tiling.

For such a restricted universe, Quasi-UOVs can be shorterthan fully universal ones, leading to more compact memory.We can also take advantage of its properties to directly findthe shortest QUOV.

Although UOV-based allocations are limited to uniformdependence programs, storage mappings that are legal for aclass of schedules is an interesting alternative to most mem-ory allocation methods that require schedules to be given.

Our extensions aim to make UOV-based allocations morepractical by providing efficient method for finding the short-est UOV for a smaller, but an important universe, tilableprograms.

9. REFERENCES[1] C. Alias, F. Baray, and A. Darte. Bee+Cl@k: an

implementation of lattice-based array contraction inthe source-to-source translator rose. In Proceedings ofthe 2007 ACM SIGPLAN/SIGBED conference onLanguage, Compiler and Tool Support for EmbeddedSystems, volume 13, pages 73–82, 2007.

[2] C. Bastoul. Code generation in the polyhedral modelis easier than you think. In Proceedings of the 13thIEEE International Conference on ParallelArchitecture and Compilation Techniques, PACT ’04,pages 7–16, Washington, DC, USA, 2004.

[3] U. Bondhugula, A. Hartono, J. Ramanujam, andP. Sadayappan. A practical automatic polyhedralparallelizer and locality optimizer. In Proceedings ofthe 29th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, PLDI ’08,pages 101–113, New York, NY, USA, 2008. ACM.

[4] Y. Bouchebaba and F. Coelho. Tiling and memoryreuse for sequences of nested loops. In Proceedings ofthe 8th International Euro-Par Conference, volume2400, page 255, 2002.

[5] A. Cohen. Parallelization via constrained storagemapping optimization. In Proceedings of theInternational Symposium on High PerformanceComputing, pages 83–94, 1999.

[6] A. Darte, R. Schreiber, and G. Villard. Lattice-basedmemory allocation. IEEE Transactions on Computers,54(10):1242–1257, 2005.

[7] P. Feautrier. Dataflow analysis of array and scalarreferences. International Journal of ParallelProgramming, 20(1):23–53, 1991.

[8] P. Feautrier. Some efficient solutions to the affinescheduling problem, II, multidimensional time.International Journal of Parallel Programming,21(6):389–420, 1992.

[9] M. Griebl, P. Feautrier, and C. Lengauer. Index setsplitting. International Journal of ParallelProgramming, 28(6):607–631, 2000.

[10] S. Guelton, A. Guinet, and R. Keryell. Buildingretargetable and efficient compilers for multimediainstruction sets. In 2011 International Conference on

21

Page 30: ProceedingsofIMPACT2013 - cs.colostate.edu

Parallel Architectures and Compilation Techniques,pages 169–170, 2011.

[11] G. Gupta and S. Rajopadhye. Simplifying reductions.In Proceedings of the 33rd ACM Conference onPrinciples of Programming Languages, PoPL ’06,pages 30–41, New York, NY, USA, Dec 2006. ACM.

[12] F. Irigoin and R. Triolet. Supernode partitioning. InProceedings of the 15th ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages,PoPL ’88, pages 319–329. ACM, 1988.

[13] V. Lefebvre and P. Feautrier. Automatic storagemanagement for parallel programs. ParallelComputing, 24(3-4):649–671, 1998.

[14] W. Pugh. A practical algorithm for exact arraydependence analysis. Communications of the ACM,35(8):102–114, 1992.

[15] F. Quillere and S. Rajopadhye. Optimizing memoryusage in the polyhedral model. ACM Transactions onProgramming Languages and Systems, 22(5):773–815,2000.

[16] F. Quillere, S. Rajopadhye, and D. Wilde. Generationof efficient nested loops from polyhedra. InternationalJournal of Parallel Programming, 28(5):469–498, 2000.

[17] J. Ramanujam and P. Sadayappan. Tiling of iteration

spaces for multicomputers. In Proceedings of the 1990International Conference on Parallel Processing,volume 2 of ICPP ’90, pages 179–186, 1990.

[18] R. Schreiber and J. J. Dongarra. Automatic blockingof nested loops. Technical report, 1990.

[19] M. Strout, L. Carter, J. Ferrante, and B. Simon.Schedule-independent storage mapping for loops.ACM SIGOPS Operating Systems Review,32(5):24–33, 1998.

[20] W. Thies, F. Vivien, J. Sheldon, and S. Amarasinghe.A unified framework for schedule and storageoptimization. In Proceedings of the 22nd InternationalConference on Programming Language Design andImplementation, PLDI ’01, pages 232–242. ACM,2001.

[21] D. Wilde and S. Rajopadhye. Memory reuse analysisin the polyhedral model. In Proceedings of the 2ndInternational Euro-Par Conference, pages 389–397,1996.

[22] M. Wolfe. Iteration space tiling for memoryhierarchies. In Proceedings of the Third SIAMConference on Parallel Processing for ScientificComputing, pages 357–361. Society for Industrial andApplied Mathematics, 1987.

22

Page 31: ProceedingsofIMPACT2013 - cs.colostate.edu

On Demand Parametric Array Dataflow Analysis

Sven VerdoolaegeConsultant for LIACS, Leiden

INRIA/ENS, [email protected]

Hristo NikolovLIACS, Leiden

[email protected]

Todor StefanovLIACS, Leiden

[email protected]

ABSTRACTWe present a novel approach for exact array dataflow anal-ysis in the presence of constructs that are not static affine.The approach is similar to that of fuzzy array dataflow anal-ysis in that it also introduces parameters that represent in-formation that is only available at run-time, but the pa-rameters have a different meaning and are analyzed beforethey are introduced. The approach was motivated by ourwork on process networks, but should be generally usefulsince fewer parameters are introduced on larger inputs. Weinclude some preliminary experimental results.

1. INTRODUCTION AND MOTIVATIONArray dataflow analysis [12] (also known as value-based

dependence analysis [20]) is of crucial importance in thepolyhedral framework with applications in array expansion[2, 10], scheduling [13], equivalence checking [5, 27] optimiz-ing computation/communication overlap in MPI programs[18] and the derivation of process networks [24, 28] to namebut a few. For program fragments that are static affine,dataflow analysis can be performed exactly [12, 15, 20]. Forprograms that contain certain dynamic and/or non-affineconstructs, both exact [4] and approximate [21] approacheshave been proposed. In particular, fuzzy array dataflowanalysis (FADA) [4] introduces additional parameters, whosevalues depend on run-time information, that allow the de-pendences to be represented exactly. After all parametershave been introduced, certain properties on the parametersare derived that allow for a simplification of the result. Inthe end, the parameters may also be projected out, resultingin approximate but static dependences.

The initial motivation for our new approach stems fromour work on the derivation of process networks. A processnetwork derived from a sequential program is essentiallya refinement of the dataflow graph of the program, wherenodes in the graph correspond to processes and edges tocommunication channels, and is mainly used to representthe task-level parallelism available in the program. Theseprocess networks may then be mapped to various hardwareimplementations [17]. The parameters introduced by FADAcan be used to construct control channels between the differ-ent processes [22]. Unfortunately, preliminary experimentswith the only known publicly available implementation ofFADA [6] have shown that this approach tends to introducetoo many parameters to be practically useful, whence theneed for an alternative approach. For our application, weare mainly interested in dynamic and/or non-affine condi-tions and this is currently the only extension beyond static

affine programs that we support.Our approach shares many similarities with FADA and

is therefore also useful for other applications of this ap-proach [3] (a description of these other applications is how-ever beyond the scope of the present paper). In particular,we also introduce parameters whose values depend on run-time information. However, our parameters have a differentmeaning, essentially representing the last iteration of a po-tential source that was executed, and we analyze their effectbefore they are introduced. This allows us to (typically) in-troduce fewer parameters, resulting in simpler dependencerelations that can be computed more efficiently. Unlike theFADA approach, our approach does not rely on a resolutionengine, but instead performs operations on affine sets andrelations to determine which parameters to add and whichconstraints they should satisfy.

We start with a description of our representation of staticaffine programs in Section 2 and an overview of standarddataflow analysis in Section 3. These sections also introducenotation that is used in later sections. The representationof dynamic conditions in the input is explained in Section 4,while the representation of dynamic dependence relations isexplained in Section 5. Section 6 describes the computationof dynamic dependence relations. In Section 7, we discusssome extensions. We conclude with preliminary experimen-tal results in Section 8 and a comparison to related work inSection 9.

2. PROGRAM REPRESENTATIONEach input program is represented using a polyhedral

model [14], consisting of iteration domains, access relations,dependence relations and a schedule. Each statement hasan associated iteration domain containing the values of theiterators of the enclosing loops for which the statement is ex-ecuted. For example, the iteration domain of the statementin Line 14 of Listing 1 is

{ H(k, i, j) | 0 ≤ k, j ≤ 99 ∧ 0 ≤ i < 99 }. (1)

Note that n is modified inside the program and can there-fore not be treated as a parameter. This iteration domainis therefore an overapproximation of the set of executed it-erations, based on the bounds on n specified by the user inLine 3. The mechanism for filtering out the iterations thathave not actually been executed is explained in Section 4.An access relation maps elements of an iteration domain tothe element(s) of an array domain accessed by that itera-tion of the associated statement through some array refer-ence. For example, the access to a in the same statement is

23

Page 32: ProceedingsofIMPACT2013 - cs.colostate.edu

1 int n, m;

2 int a[100][100];

3 #pragma value_bounds n 0 99

4 #pragma value_bounds m 0 100

5

6 N1: n = f();

7 for (int k = 0; k < 100; ++k) {

8 M: m = g();

9 for (int i = 0; i < m; ++i)

10 for (int j = 0; j <= n; ++j)

11 A: a[j][i] = g();

12 for (int i = 0; i < n; ++i)

13 for (int j = 0; j < m; ++j)

14 H: h(i, j, a[i + 1][j]);

15 N2: n = f();

16 }

Listing 1: Code with locally static conditions

represented as { H(k, i, j) → a(i + 1, j) }, simplified with re-spect to the iteration domain. In this paper, a dependencerelation maps a (nested) read access relation to the writeaccess relation that wrote the value being read by the readaccess. For example, the dependence relation for the aboveread access is { (H(k, i, j) → a(i + 1, j)) → (A(k, j, i + 1) →a(i + 1, j)) | 0 ≤ k, j ≤ 99 ∧ 0 ≤ i < 99 }, while the depen-dence relation for the read of n in the bound of the enclosingloop (simplified with respect to the iteration domain) is

{ (H(0, i, j)→ n())→ (N1()→ n()) } ∪{ (H(k, i, j)→ n())→ (N2(k − 1)→ n()) | k ≥ 1 }. (2)

Note that the above access and dependence relations are allsingle-valued functions. We will, however, also use the “→”notations for other relations that may not represent single-valued functions. The schedule maps the union of all itera-tion domains to a common space where the execution orderof the corresponding statement iterations is determined bythe lexicographical order.

Each of the above sets and relations is represented inisl [25]. The iteration domains, the access relations andthe initial schedule are extracted using pet [26], while theconstruction of the dependence relations is the topic of thepresent paper. In the base case, the input program con-sists of expression statements, if conditions and for loopswith only static quasi-affine index expressions, conditionsand bounds. The representation of dynamic (or non-affine)conditions, the main focus of the present paper, is explainedin Section 4. Dynamic loop conditions and dynamic indexexpressions are briefly discussed in Section 7. The initialschedule describes an ordering that corresponds to the orig-inal execution order. To simplify the exposition, we will nottake arbitrary initial schedules as input, but instead exploitthe positions of the statements inside the abstract syntaxtree, in particular the number of shared outer loops and therelative order of pairs of statements.

3. STANDARD DATAFLOW ANALYSISSeveral algorithms have been proposed in the literature for

performing dataflow analysis in the case of static affine pro-grams [12,15,20]. Our implementation in isl is a variationof these algorithms. We do not claim any novelty in this

implementation and we will not describe the implementa-tion in detail here. Instead, we only highlight those aspectsthat are important for understanding the remainder of thispaper.

Dataflow analysis is performed separately for each readaccess (a.k.a., “sink” or “consumer”) C. For each such readaccess we consider all the write accesses (a.k.a., “potentialsource”or“producer”) P that access the same array, orderingthem such that the“closest”are considered first. Let AC andAP be the corresponding access relations. Furthermore, letB represent the relative ordering of the statements. That is,if I is the union of all iteration domains, then B is a binaryrelation on I with i → j ∈ B iff j is executed before i. Forexample, { H(k, i, j) → N2(k′) | 0 ≤ k, k′, i, j ≤ 99 ∧ k′ < k }is a subset of B. Note that bold variable names are usedto denote (named) integer tuples. Furthermore, if S is a setand if R, R1 and R2 are relations, then let 1S be the identityrelation on S, i.e,

1S = { s→ s | s ∈ S },let R−1 be the inverse of R, i.e.,

R−1 = { t→ s | s→ t ∈ R },let R2 ◦R1 be the composition of R1 and R2, i.e.,

R2 ◦R1 = { s→ u | ∃t : s→ t ∈ R1 ∧ t→ u ∈ R2 },let R1 ×R2 denote the cross product of R1 and R2 with

R1×R2 = { (s→ t)→ (u→ v) | s→ u ∈ R1∧t→ v ∈ R2 }and let ran−→R map a nested copy of R to its range, i.e.,

ran−→R = { (s→ t)→ t | s→ t ∈ R }.

If the value read by an element in the domain of AC waswritten inside the program fragment under consideration,then it was written by one of the domain elements of oneof those write access relations AP . In particular, it is anelement of the image of the relation

(A−1

P ◦AC

)∩ B for

some P . Since we want to keep track of the array elementsbeing accessed, we can extend the above computation to

DmemC,P =

((ran−→AP )−1 ◦ (ran−→AC)

)∩ (B × 1A) , (3)

where DmemC,P refers to the “memory based” dependences of

C on P and A is the union of all array domains. Forexample, if AP = { N2(k) → n() | 0 ≤ k ≤ 99 } thenran−→AP = { (N2(k) → n()) → n() | 0 ≤ k ≤ 99 }. Com-

bining this with AC = { H(k, i, j) → n() | 0 ≤ k, i, j ≤ 99 },we obtain Dmem

C,P = { (H(k, i, j) → n()) → (N2(k′) → n()) |0 ≤ k, k′, i, j ≤ 99 ∧ k′ < k }.

If the iteration domain of the statement containing C hasdimension d, then we first consider elements of the Dmem

C

relations such that the first d iterators in domain and rangehave the same value, then d − 1 iterators, continuing un-til we consider the case where 0 iterators have the samevalue. Let ` denote this number of equal iterators. For eachvalue of `, we consider the potential sources P that share atleast ` outer loops, compute the maximum image elementof Dmem

C,P ∩ (E`= × 1A), with E`

= expressing that exactly `outer iterators have the same value, adding constraints thatensure this maximal element is executed after any previ-ously computed maximum at this level. When moving from` to ` − 1, we only consider those elements from the sink

24

Page 33: ProceedingsofIMPACT2013 - cs.colostate.edu

access relation for which no source has been found so far.The main operation in this computation is then that of thepartial lexicographical maximum of a relation M on a do-main U , which returns a relation mapping elements u of Uto the lexicographically greatest element associated to u byM , along with a set of elements in U that do not have anyimages in M . Let us define some more operations on setsand relations. Specifically, if S1 and S2 are sets and if R is arelation, then the universal relation from S1 to S2 is denoted

S1 → S2 = { s→ t | s ∈ S1 ∧ t ∈ S2 },while the domain and range of R are denoted

domR = { s | s→ t ∈ R }.and

ranR = { t | s→ t ∈ R }.The lexicographical maximum of a relation R is then de-noted lexmaxR and is equal to

{ s→ t | s→ t ∈ R ∧ (∀t′ : s→ t′ ∈ R⇒ t′ 4 t) },“4” representing the lexicographical order, Finally, the par-tial lexicographical maximum of M on U is

lexmaxU

M = (lexmax(M ∩ (U → ranM)), U \ domM)

(4)The partial lexicographical maximum can be computed us-ing parametric integer programming [11]. In the example,the read of n in Line 12 can only have 0 equal loop it-erators with the write in Line 15. We therefore computelexmaxU D

memC,P with U the (nested) sink access relation.

The result consists of the (simplified) relation { (H(k, i, j)→n())→ (N2(k− 1)→ n()) | k > 0 } and the set { (H(0, i, j)→n()) }. Sources for this latter part of the sink relation canbe found in Line 6. The final result is shown in (2).

4. FILTERSAs explained in Section 2, if there are any dynamic condi-

tions in the program, then the iteration domain may be anaffine overapproximation of the set of executed iterations.To mark those iterations that are actually executed, we ap-ply one or more filters to the iteration domain. These filtersencode the dynamic conditions that determine whether aniteration in the iteration domain is actually executed. Forexample, as shown before, the iteration domain (1) is anoverapproximation of the executed iterations of the state-ment in Line 14 of Listing 1. The filter on this iterationdomain then expresses the dynamic conditions i < n andj < m. Each filter consists of a sequence of filter access re-lations, accessing variables that may be updated during theexecution of the program, and a filter value relation, map-ping statement iterations to the possible values of the dy-namic variables for which the iteration is executed. For sim-plicity of exposition, we will assume that all the filter accessrelations are functions and that there is only one filter. Themore general case is discussed in Appendix A. Non-affineconditions are treated as dynamic conditions.

More formally, let S be a statement and IS its iterationdomain. Furthermore, let the filter on S consist of a filtervalue relation V S and nS filter access relations FS

i . We have

FSi ⊆ IS → (IS → A) and V S ⊆ IS → ZnS

. That is, eachfilter access relation maps an iteration to an access of an

array element and the filter value relation maps an iterationto a tuple of values. Moreover, each FS

i is a function on theiteration domain. The application of the relation R to theset S is defined as

R(S) = {u | ∃t : t→ u ∈ R ∧ t ∈ S }.Furthermore, let

(aj)nj=1

be the n-tuple with elements aj and let V({ j → b }) bethe value of b at j. Iteration k ∈ IS is then executed iffexecutedS(k) holds, with

executedS(k) =(V(FSj (k)

))nS

j=1∈ V S(k), (5)

where FSj (k) is short for FS

j ({k }). An access k → a froman iteration k is executed iff the iteration is executed, sothat executedS((k→ a)) = executedS(k). Let dom−−→R map a

nested copy of R to its domain, i.e.,

dom−−→R = { (s→ t)→ s | s→ t ∈ R },

then, initially, each filter access relation is a subset of the re-

lation(

dom−−→(I → A))−1

. That is, each iteration is mapped

to an access that takes place at the same iteration.We consider two types of filters, one for the general case

and one for the special case of “locally static affine” condi-tions. A condition is considered to be locally static affine ifit is an affine expression in variables that are definitely notmodified between the point where they are evaluated andall the statements that are guarded by the condition. Suchvariables are called “locally static”. In this case, the accessesto the locally static variables themselves are used as filteraccesses. Otherwise, a new statement is created that eval-uates the condition and writes the resulting boolean value(0 or 1) to a virtual array, as in [26, Section 4.3]. This vir-tual array is then used as a filter access in a condition thatsimply evaluates the element of the virtual array. The firsttype of filter is allowed in both if-conditions and loop con-ditions, while the second type is in the current context onlyallowed in if-conditions. In Section 7.1, we explain how toalso handle the second type in loop conditions.

Consider, for example, the code in Listing 1. The condi-tion i < n in Line 12 references the variable n which is nota parameter because it is assigned in Line 6 and in Line 15.However, the value of n is clearly not changed between thecondition in Line 12 and the statement in Line 14 governedby this condition. Similarly, m is also locally static for thestatement. The filter for statement H therefore has filteraccess relations

F H1 = { H(k, i, j)→ (H(k, i, j)→ n()) } (6)

and

F H2 = { H(k, i, j)→ (H(k, i, j)→ m()) }. (7)

The filter value relation V H is

{ H(k, i, j)→ (n,m) | 0 ≤ k ≤ 99 ∧ 0 ≤ i < n ∧ 0 ≤ j < m }.(8)

For convenience to the reader, the dimensions that corre-spond to the filters have been named after the arrays (inthis case scalars) that are being accessed by the correspond-ing filter access relations. However, the reader should keep

25

Page 34: ProceedingsofIMPACT2013 - cs.colostate.edu

1 state = 0;

2 while (1) {

3 sample = radioFrontend ();

4 if (t(state)) {

5 D: state = detect(sample );

6 } else {

7 C: decode(sample , &state , &value0 );

8 value1 = processSample0(value0 );

9 processSample1(value1 );

10 }

11 }

Listing 2: Code with dynamic conditions, adaptedfrom [7, Figure 8.1]

in mind that these names are completely arbitrary. Notethat the condition of the loop in Line 12 is allowed since itis locally static affine.

The code in Listing 2 has a generic dynamic condition inLine 4. pet introduces a separate virtual array, say t0, forstoring the result of the condition and a separate statement,say S0, for computing this result. The access relation thatwrites to the filter is of the form

{ S0(i)→ t0(i) }.Note that pet introduces an implicit iterator (called i in thisrelation) with non-negative values [26, Section 3.3] for theloop while (1) in Line 2. The filter value relation of thestatement in Line 5 is

V D = { D(i)→ (1) | i ≥ 0 }with filter access relation

F D = { D(i)→ (D(i)→ t0(i)) }.That is, the statement is executed for values of i greater thanor equal to 0 such that t0(i) at D(i) is equal to 1. (Recallthat t0(i) is a boolean variable that only attains values 0and 1.) The filter value relation of the statement in Line 7is

V C = { C(i)→ (0) | i ≥ 0 } (9)

with filter access relation

F C = { C(i)→ (C(i)→ t0(i)) }. (10)

Most of the analysis in Section 6 is based on filter valuesbeing equal or different in different iterations. We there-fore need to be able to identify that filter values accessed indifferent iterations are actually the same. This informationcan be obtained by applying dataflow analysis on the arraysaccessed by the filters. As explained below, the result of thisanalysis may or may not depend on any parameters as de-fined in Section 5. If any such parameters are involved, thenwe keep the original filter access relations. If, on the otherhand, the resulting dependence relations do not depend onany such parameters, then the original filter access relationscan be replaced by a composition with these dependence re-lations. For the code in Listing 1, the dependence relationfor n in Line 12 is shown in (2). Applying this dependencerelation to the filter access relation in (6) yields

F H1 = { H(0, i, j)→ (N1()→ n()) } ∪{ H(k, i, j)→ (N2(k − 1)→ n()) | k ≥ 1 }. (11)

For the filter access relation in (10), no dataflow analysisneeds to be performed since we know each element of thevirtual array t0 is written by exactly one iteration of thestatement S0. The filter access relation can therefore bereplaced by

F C = { C(i)→ (S0(i)→ t0(i)) }. (12)

In principle, the meaning of V depends on whether sourceshave been found for the filter array or not, i.e., whether wehave been able to compose the original filter access relationswith dependence relations or not. In particular, if sourceshave been found, then V({ j → b }) is the value of b afterthe write in iteration j, while if sources have not been found,then V({ j → b }) is the value of b before the read in itera-tion j. In practice, however, this difference is not importantsince filter accesses for which no sources have been foundwill never be matched to any other filter accesses.

5. PARAMETER REPRESENTATIONIf the iteration domain of a potential source P is affected

by any filters, then we cannot simply compute the lexico-graphical maximum in (4) since the maximal element in therange of M may not actually be executed, even if some ofthe other elements are. We will therefore introduce param-eters that represent the “last executed” source iteration. Byequating the iterators in the range of M to these parame-ters, the lexicographical maximization operation will simplyreturn these parameters, which are guaranteed to representan executed iteration.

The exact form and meaning of the parameters dependson whether we are considering the final result of the data-flow analysis or intermediate results. Let us first considerthe final result. Let DC,P be the final dependence relation,

the description of which may involve some parameters βQC (k)

and λQC(k), where Q may be either P or some other poten-

tial source and k is an element of the sink access relationAC . Note that in principle the parameters depend on thesink access k, but as we will explain below, we do not needto make this dependence explicit in our representation. LetD′C,P be the result of projecting out all these parameters.The parameters then have the following meaning. The pa-rameter βQ

C (k) is a boolean variable that expresses whether

any of the elements in D′C,P (k) is executed. If βQC (k) = 1,

then λQC(k) represents the last element in D′C,P (k) that is

executed. (If βQC (k) = 0, then λQ

C(k) is undefined.) In otherwords, we have

βPC (k) = 0⇒∀j ∈ D′C,P (k) : ¬executedSP (j)

βPC (k) = 1⇒executedSP (λP

C(k)) ∧∀j ∈ D′C,P (k) : j � λP

C(k)⇒ ¬executedSP (j),(13)

with SP the statement containing access P and executed asdefined in (5). For example, let C by the access to state inthe statement S0 evaluating the condition in Line 4 of List-ing 2. Let P by the access to the same variable (state) fromstatement C. The dependence relation for the dependenceof C on P is of the form (βP

C , λPC)→ { (S0(i)→ state())→

(C(λPC(i))→ state()) | βP

C (i) = 1∧i = λPC(i)+1 ≥ 1 }. That

is, there is only a dependence if access P (from statementC) was ever executed (before S0(i)) and if the last executionwas in the previous iteration. Note that if the last execution

26

Page 35: ProceedingsofIMPACT2013 - cs.colostate.edu

was in some earlier iteration, then C would not depend onthe access in statement C, but on that in statement D. Thisresult is obtained in Section 6.4.

During the computation, the resulting dependence rela-tion is obviously not known, but we do know that it is asubset of Dmem

C,P (3). The relation M in the current maxi-mization problem (4) is also a subset ofDmem

C,P . We can there-

fore temporarily treat λPC(k) as the last element of Dmem

C,P (k)that is executed. Note that because we still assume staticaffine index expressions, the array element accessed by thislast executed element of Dmem

C,P (k) is necessarily the same asthat accessed by k. We can also exploit additional informa-tion to reduce the number of elements in the λP

C(k) vectorthat need to be explicitly represented. In particular, weknow that the first ` iterators in the domain and range of M(with ` as in Section 3) are pairwise equal and that the sameproperty holds for the previously considered maximizationproblems within the current dataflow problem, i.e., for thesame C. This allows us to avoid introducing elements of theλP

C(k) vector until we really need them. In particular, wekeep track in σP

C (k) of the number of initial elements in thedomain of k and in λP

C(k) that are (implicitly) equal to eachother. Values of σP

C (k) smaller than ` then mean that thereis no element in Dmem

C,P (k) that is executed and that sharesthe values of the first ` iterators with k. As a special case,σPC (k) < 0 means that no element in Dmem

C,P (k) is executed.Summarizing, (13) is replaced by

σPC (k) < `⇒∀j ∈ D′′C,P (k) : ¬executedSP (j)

σPC (k) ≥ `⇒executedSP (λP

C(k)) ∧∀j ∈ D′′C,P (k) : j � λP

C(k)⇒ ¬executedSP (j),(14)

where D′′C,P = DmemC,P ∩E`

≥, with E`≥ expressing that at least

` outer iterators have the same value.After the dataflow computation has finished, we need to

convert the intermediate representation (14) to the final rep-resentation (13). If `P≤ is the smallest value of ` for which wehad to apply parametrization for a given potential source P ,then we do not need to introduce dimensions of λP

C before`P≤. Instead, we need to make the equalities implied by σP

explicit and set βPC (k) = 1 when σP ≥ `P≤ and βP

C (k) = 0

when σP < `P≤. The parameter σP can then be projected

out. Besides dimensions before `P≤, we also do not need to

introduce dimensions of λPC for which Dmem

C,P (k) attains asingle value (for any given value of k) or that correspond toloops inside the innermost condition that is not static affine.

Since λPC and σP

C depend on the sink iteration, it mayappear that we would need to treat them as uninterpretedfunctions. During the entire computation, there is howeverno interaction between different sink iterations. That is,any of the intermediate relations during the computationonly refers to a single sink iteration and the parameters λP

C

and σPC (if present) always refer to that single sink iteration.

This means that we can keep the relation between λPC and

σPC on one hand and k on the other hand entirely implicit

and simply treat λPC and σP

C as parameters. The intendedmeaning of those parameters is then the value of the cor-responding functions at the particular value of k involvedin the given access or dependence relation. This reasoningis essentially the same as that of [1] for showing that his αvectors can be treated as parameters.

Algorithm 1: Parametric partial lexicographical maxi-mum

(type, S1, S2) = parametrization(M , U)if type = Input then

M := intersect range(M , S1)U := intersect(U , S2)

else if type = Empty thenM := empty(M)

end(R,E) = partial lexmax(M , U)if type = Output then

R := intersect range(R, S1)end

6. PARAMETRIZATIONRecall that during dataflow analysis (Section 3), we fre-

quently compute a partial lexicographical maximum of theform (4). The inputs to this operation are based on thestatic affine iteration domains and so we may need to in-troduce parameters representing the last executed sourceiteration (Section 5) to take into account the filters (Sec-tion 4) on the iteration domains. In particular, we replace acall “(R,E) = partial lexmax(M , U)”, corresponding to (4),by the pseudocode in Algorithm 1. The different types ofparametrization in this code are explained in Section 6.1,while the determination of which of these parametrizationsshould be applied is explained in Section 6.3. The sets S1

and S2 used during the parametrization are constructed inSection 6.2 and Section 6.4.

6.1 Types of ParametrizationWe are presented with a maximization problem of the

form (4), with U a set of iterations of a sink C and Ma relation between sink iterations and iterations of a po-tential source P and we want to decide if we need to in-troduce parameters. In principle, the result is that eitherparametrization is required (type = Input) or it is not re-quired (type = None). However, we also consider a coupleof other special cases (type = Empty and type = Output).In particular, we may find that it is impossible for any ofthe source iterations to execute given that the correspondingsink is executed. In such cases we want to avoid introducingparameters since there is no dependence and we thereforedo not need any extra parameters to represent the depen-dence. However, we cannot indicate to the dataflow analysisthat no parametrization is required (type = None) as thenit would treat the source iterations as definitely being ex-ecuted. Instead, we communicate to the dataflow analysisthat M should be replaced by an empty relation (type =Empty). Another special case occurs when we are able todetermine that no parametrization is required, but that thisdetection depends on information available about other po-tential sources. For reasons beyond the scope of the presentpaper, the construction of process networks can in this casebe facilitated by introducing parameters anyway, but onlyon the result of the maximization problem rather than onthe input of the maximization problem (type = Output).A schematic overview of the determination of the type ofparametrization to apply is shown in Algorithm 2. The pro-cess is explained in more detail in Section 6.3.

27

Page 36: ProceedingsofIMPACT2013 - cs.colostate.edu

6.2 ApplicationIf parametrization is required, then we need to equate the

source iteration to the parameters λP representing the lastexecuted iteration of the potential source P (14). That is,the range of M (or, more precisely, the domain of the nestedrelation in this range) is intersected with

(λPI , σ

P )→ { SP (j) | jI = λPI ∧ σP ≥ ` }, (15)

where I selects the elements of λP that need to be explicitlyrepresented as explained in Section 5 and ` the number ofequal outer iterators in the domain and range of M as inSection 3. These constraints determine the set S1 in Algo-rithm 1. For example, let C be the read of A in Line 14 ofListing 1 and let P be the write in Line 11. The correspond-ing statements share at most one loop iterator. When ` = 1,then we have that M is equal to

{ (H(k, i, j)→ a(i+ 1, j))→ (A(k, j, i+ 1)→ a(i+ 1, j)) },(16)

where the outer dimensions k are equal because ` = 1 andthe inner dimensions i′ and j′ are equal to j and i + 1 be-cause the same array element is accessed. As we will see inSection 6.3, no parametrization is required for this problem,but let us for illustrative purposes assume that we do wantto apply parametrization. Dimension 0 of λ does not needto be introduced (yet) because 0 < `. Since Dmem

C,P is equalto

{ (H(k, i, j)→ a(i+ 1, j))

→ (A(k′, j, i+ 1)→ a(i+ 1, j)) | k′ ≤ k },the remaining dimensions of λ do not need to be intro-duced either because they are fully determined by Dmem

C,P

(and therefore also by D′C,P ⊆ DmemC,P ). In effect, we would

only need to introduce σP with constraint σP ≥ 1.The parametrization of the source domain expresses that

the source is (essentially) the last of the potential source it-erations associated to the sink through Dmem

C,P . However, itonly does so through the introduction of parameters that im-plicitly depend on the sink iteration k. The parametrizationof the sink takes care of expressing that this last iteration,if it exists, actually belongs to Dmem

C,P (k). In particular, wefirst split the set of sink iterations U into two parts, one withassociated potential source iterations and one without, i.e.,

U1 = U ∩ (domM) and U2 = U \ (domM). (17)

The parametrization is only applied to U1 and expresses thateither none of the potential source iterations in Dmem

C,P (k)that share the first ` iterators is executed or that there issuch an executed potential source iteration and that it be-longs to Dmem

C,P (k). That is, the sink C is intersected with

(λP , σP )→{k | σP < ` ∨

(λP ∈ Dmem

C,P (k) ∧ σP ≥ `)}

.

(18)These constraints, together with those of Section 6.4 below,determine the set S2 in Algorithm 1. In practice, we onlyintroduce the same set of dimensions of the λP vector asintroduced by (15). In particular, the second disjunct is ob-tained by applying the parametrization of (15) to the rangeof Dmem

C,P and computing the domain of the result.

6.3 DetectionLet us now explain in more detail the steps in the de-

termination of which parametrization to apply as sketched

Algorithm 2: Type of parametrization

1 if M is empty, U is empty or there are no filters on thesource then

2 return None3 end4 F := filters on the sink5 if filters on the source contradict F then6 return Empty7 end8 F ′ := update(F , filters on other sources)9 if filters on the source contradict F ′ then

10 return Empty11 end12 if filters on the source imply F then13 return None14 end15 if filters on the source imply F ′ then16 return Output17 end18 return Input

in Algorithm 2. If M and/or U are empty or if P is notaffected by any filter, then no parametrization is required(Line 2). Otherwise, we compute the possible values for thefilter elements at the potential source, given that the sink isexecuted. Clearly, we can only do this if the sink is affectedby a filter and if source and sink have some filter accesses incommon. If the computed possible values are disjoint fromthe filter value relation on the source, then no source itera-tion is executed when the sink is executed and we indicatethat M should be replaced by an empty relation (Line 6).Otherwise, we check if U references any parameters thatwere introduced by previous calls to the parametrization.If so, we use the constraints on those parameters and thefilters of the associated (other) potential sources to deriveextra information about the filters at the sink (Line 8). Thisderivation is explained in Section 6.3.2. The resulting infor-mation is then propagated again to potential source P anda new relation of possible values is computed. If this rela-tion is disjoint from the filter value relation on the source,then we again indicate that M should be replaced by anempty relation (Line 10). Otherwise, if the computed re-lation is a subset of the filter value relation on the source,then we know the source is always executed and we indi-cate that either no parametrization is required (Line 13) orthat parametrization is only required on the output of themaximization (Line 16), depending on whether the relationcomputed based on only information on the sink was alreadya subset of the filter value relation. Finally, if none of thesecases apply, then parametrization is required (on the inputof the maximization problem, Line 18).

6.3.1 Filter Values implied by the SinkLet us illustrate the derivation of filter value constraints

on the potential source from those on the sink based on anexample. The general case is explained in Section A.2. Inparticular, let us reconsider the example from Section 6.2.Let M1 be the result of projecting out the array elementsfrom M , i.e.,

M1 = { H(k, i, j)→ A(k, j, i+ 1) | 0 ≤ i ≤ 98 },

28

Page 37: ProceedingsofIMPACT2013 - cs.colostate.edu

where we omit the constraints on j and k for brevity. Thefilters on statement A are similar to those on H in (6) and (7).The first of these was updated in (11) to take into accountthe filter source. The second can be updated to

F H2 = { H(k, i, j)→ (M(k)→ m()) | 0 ≤ i ≤ 98 }.

For statement A, we have, say,

F A1 = { A(k, i, j)→ (M(k)→ m()) | 0 ≤ j ≤ 99 }.

We want to check whether the fact that the sink is executedtells us anything about the values of these filter variables atthe source. Note that if the sink is not executed, then itdoes not need any values and so there is no need to com-pute dataflow dependences for sink iterations that are notexecuted.

The first step is to check whether any of the source filterarrays are also accessed by the corresponding sink iterations.To do so, we pull back the filter access relations over M1,resulting in

F A1 ◦M1 = { H(k, i, j)→ (M(k)→ m()) | 0 ≤ i ≤ 98 },

where the constraints 0 ≤ i ≤ 98 derive from 0 ≤ i+ 1 ≤ 99and the domain constraints of M1. Since this relation is asubset of F H

2 i.e., ∀j ∈M1(k) : F A1 (j) ⊆ F H

2 (k), and similarlyfor F A

2 ◦M1 ⊆ F H1 , we know that (5) also holds for the source

filter accesses for all j ∈M1(k), i.e.,(V(F Ai (j)

))2i=1∈ V1(k)

with V1 derived from V H in (8) by changing the order of therange dimensions to match the matching of the filters, i.e.,

{ H(k, i, j)→ (m,n) | 0 ≤ k ≤ 99 ∧ 0 ≤ i < n ∧ 0 ≤ j < m }.Note that in this particular example, we have F A

1 ◦M1 = F H2

and F A2 ◦M1 = F H

1 but equality is not required in the generalcase. There is also no need for every filter access relation onthe source to match a filter access relation on the sink andvice versa. Besides reordering dimensions of V H, we may inthe general case also have to project out dimensions and/orintroduce unconstrained dimensions.

Precomposing V1 with M−11 , we obtain a mapping from

source iterations to filter values that allow one or more cor-responding sink iterations to be executed. In the example,we obtain

{A(k, i, j)→ (m,n) | 0 ≤ k ≤ 99∧0 ≤ j−1 < n∧0 ≤ i < m}.Since this relation is a subset of the filter value relation onA, i.e.,

{ A(k, i, j)→ (m,n) | 0 ≤ k ≤ 99 ∧ 0 ≤ i < m ∧ 0 ≤ j ≤ n },we know that for each sink iteration in the domain of M ′

that is executed, the corresponding source iterations are alsoexecuted and therefore no parametrization is required.

6.3.2 Filter Values implied by Other SourcesThe derivation of extra information from other potential

sources is again illustrated based on an example. The gen-eral case is explained in Section A.3. In particular, let usconsider the determination of the sources for the access tostate in Line 4 of Listing 2. We first consider the write Pin Line 7 as a potential source for ` = 0. Parametrization isrequired and in particular (18) specializes to

(λP0 , σ

P )→{S0(i) | σP < 0 ∨

(0 ≤ λP

0 < i ∧ σP ≥ 0)}

.

The dataflow analysis then continues looking for sourcesfrom the write Q in Line 5. Let us now consider the casewhere σP < 0. Since statement S0 is not affected by anyfilters, we need to turn to the other potential sources, inparticular P , for any constraints that could help us deter-mine whether parametrization is required.

We first construct a relation N mapping sink iterationsto iterations of P that have definitely not been executed(according to (14) and given σP < 0), i.e.,

N = { S0(i)→ C(i′) | 0 ≤ i′ < i }.

This relation can be constructed by subtracting iterationsthat may have executed from Dmem

C,P (with the array ele-ments projected out). In the example, there are no itera-tions that may have executed (since σP < 0) and so in thiscase N = Dmem

C,P . From (5), we know that the values of thecorresponding filter elements (12) do not satisfy the filteraccess relation (9), i.e., for all S0(i)→ C(i′) ∈ N we have

V(F C(C(i′))) 6∈ V C(C(i′)).

In other words,

V(F C(C(i′))) ∈ V1(C(i′)),

with

V1 = { C(i)→ (1) | i ≥ 0 }.Note that t_0 is a boolean variable, so if its value is not 0, itmust be 1. Combining N and F C (12) into a single relation,we obtain

N1 = { (S0(i)→ (S0(i′)→ t0(i

′))→ C(i′) | 0 ≤ i′ < i }.

Let R1oR2 be the domain product of two relations, mappingnested pairs of domain elements from R1 and R2 to theirshared images, i.e.,

R1 oR2 = { (s→ t)→ u | s→ u ∈ R1 ∧ t→ u ∈ R2 }.

In general, we then have N1 = N o(F C)−1

. The relationN1 maps a pair of sink iteration and filter access to sourceiterations that have definitely not been executed and thatperform the filter access, with value in V1. The same filterelement may be accessed by multiple source iterations, eachof them imposing the V1 constraints. We therefore computethe intersection of the V1 images over all associated sourceiterations. In the example, only a single source iteration isassociated to a given filter element and we obtain

V2 = { (S0(i)→ (S0(i′)→ t0(i

′))→ (1) | 0 ≤ i′ < i }.

Projecting out the filter access, we obtain the filter valuerelation

V3 = { S0(i)→ (1) | 1 ≤ i },which is valid for any element of F C(N(S0(i))), with i ≥ 1.This information can then be propagated to the potentialsource Q using the technique of Section 6.3.1, from which itcan be concluded that Q is always executed (in the currentcase where σP < 0) and that therefore no parametrization isrequired. Note that the combined filter access relation F C◦Nis no longer a function, but, as explained in Section A.2, thecomputations are also valid for multi-valued access relations.We just need to be careful about the domains of the accessrelations.

29

Page 38: ProceedingsofIMPACT2013 - cs.colostate.edu

6.4 Additional ConstraintsIf we determine that we need to apply parametrization,

then we may in some cases wish to impose additional con-straints on the introduced parameters beyond those of Sec-tion 6.2. In particular, we may find that not all potentialsource iterations are executed (otherwise no parametriza-tion would be required), but that some potential source it-erations are definitely executed. If so, we add constraintsthat impose that there is a last execution (σP ≥ `) and thatthis last execution is no earlier than any of the definitelyexecuted iterations.

Additionally, we check for conflicts between the filters as-sociated to the source of the newly added parameters andthose of previously added parameters, introducing extra con-straints that avoid the conflicts. As usual, we only show anexample, while the details are explained in Section A.4. Con-tinuing from the example in Section 6.3.2, let us consider thecase where we are looking for an instance of the write Q inLine 5 of Listing 2 that is executed after the last executionof the write P in Line 7. Parametrization is required in thiscase, but the sink already contains parameters that refer toP , so we look for possible conflicts between the parametersof P and Q.

Based on dataflow analysis on the filter arrays, e.g., (12),we know that identical iterations of D and C access the samefilter element, written in the same iteration. These filterelements therefore need to have the same value. Let us tryto derive conflicts from iterations of both statements that arenot executed. To ease the notation, we will only consider thecase σP , σQ ≥ 0 and omit these variables and constraints.The iterations in Dmem

C,P that are not executed are given by

λP0 → { S0(i)→ C(i′) | 0 ≤ λP

0 < i′ < i }

with filter value { C(i′)→ (1) } and similarly for Q. Pairingup the non-executed iterations of C and D and restrictingthem to the pairs that should have the same value, we obtain

(λP0 , λ

Q0 )→ { S0(i)→ (C(i′)→ D(i′)) | 0 ≤ λP

0 , λQ0 < i′ < i }.

Considering now the pairs of iterations from the two state-ments that map to values that allow the iterations to not beexecuted and such that these values are the same, we findthat there are no such pairs. Subtracting this set of pairsof iterations that have the same value from the range of therelation above, we obtain a mapping from sink iterations topairs of source iterations that should have the same valuesfor their filters but in fact do not. In this example, an emptyset is subtracted so that the relation remains the same. Thedomain of this relation, i.e.,

(λP0 , λ

Q0 )→ { S0(i) | 0 ≤ λP

0 , λQ0 ≤ i− 2 }.

represents conflicting values of the parameters. In particu-lar, it is impossible for the last executed iterations of P andQ to both be before the previous iteration of the loop. Theseimpossible cases are removed from the sink parametrizationS2.

7. EXTENSIONSIn this section, we briefly discuss two extensions, dynamic

loop bounds, which are supported by our dataflow analysisimplementation, and dynamic index expressions, which arecurrently not supported.

1 for (int x1 = 0; x1 < n; ++x1) {

2 S1: s = f();

3 for (int x2 = 0; P(x1, x2); ++x2) {

4 S2: s = g(s);

5 }

6 R: h(s);

7 }

Listing 3: C version of example E1 from [1, Sec-tion 3.2.2]

7.1 Dynamic Loop ConditionsIn principle, dynamic loop conditions can be handled by

introducing a filter that represents the last executed itera-tion of the loop. Consider, for example, the loop in Line 3of Listing 3. The loop condition depends on some unknownfunction P applied to the loop iterator and is therefore not(locally) static affine. We could introduce a virtual scalar,say m, that represents the final iteration of the loop and thatimplicitly depends on x1. The filter value relation wouldthen be of the form (n) → { S2(x1, x2) → (m) | 0 ≤ x1 <n ∧ 0 ≤ x2 ≤ m }. Unfortunately, the resulting dataflowdependence relations would not be practically useful for theconstruction of process networks since this last executed it-eration is not known in advance.

Instead, we record the result of the dynamic loop condi-tion in a virtual array and make the body of the loop dependon the value of the current and all previous iterations of theloop being 1. That is, the statement on Line 4 has filtervalue relation

V S2 = (n)→ { S2(x1, x2)→ (1) | 0 ≤ x1 < n ∧ x2 ≥ 0 }with filter access relation

F S2 = n→ { S2(x1, x2)→ t0(x1, a) | 0 ≤ a ≤ x2 }.Note that this filter access relation is not a function, buta multi-valued filter access relation and therefore requiresthe treatment of Appendix A. The statement evaluatingthe condition is made to depend on all previous iterations.Note that we currently only handle dynamic loop conditionsfor the purpose of dataflow analysis and not for the actualconstruction of process networks.

7.2 Dynamic Index ExpressionsOur implementation currently does not support dynamic

index expressions. We would have to allow the access rela-tions to be approximate as well and the parameters wouldchange to mean that the iteration is executed and that theelement being accessed is the same as that accessed by thesink. This change in meaning would limit the conclusionswe could draw from the new parameters.

8. EXPERIMENTSOur approach has been implemented in the da and pn

tools of the isa prototype tool suite (git://repo.or.cz/isa.git). The da tool performs dataflow analysis and sup-ports dynamic loop conditions, while the pn tool addition-ally constructs a process network and currently does notsupport dynamic loop conditions. Table 1 shows the re-sults of a preliminary experimental comparison of our da

tool against that of [6], which is an implementation of the

30

Page 39: ProceedingsofIMPACT2013 - cs.colostate.edu

input da fadatool fadatool -stime p d time p l time p l

Lst 1 0.01s 0 5 0.01s 6 6 0.01s 6 6Lst 2 0.01s 4 9 0.01s 6 16 incorrectfuzz4 0.06s 3 9 0.02s 4 9 0.01s 0 9for1 0.02s 2 3 0.01s 4 46 0.02s 2 3for2 0.03s 2 3 0.09s 12 5k 0.04s 4 3for3 0.04s 2 3 42s 24 1M 0.08s 6 3for4 0.06s 2 3 0.16s 8 3for5 0.08s 2 3 0.25s 10 3for6 0.14s 2 3 0.42s 12 3c if1 0.02s 2 3 0.01s 2 4 0.01s 2 4c if2 0.02s 2 10 0.02s 4 52 0.02s 2 8c if3 0.03s 2 22 0.03 6 723 0.36s 3 16c if4 0.02s 2 10 0.17s 8 9k 1m 4 28whil1 0.01s 0 4 0.00s 1 4 0.01s 0 4whil2 0.03s 3 4 0.01s 5 6 incorrectif var 0.03s 4 3 0.01s 2 8 0.01s 2 4if wh 0.04s 2 14 0.01s 5 58 0.02s 4 58if2 0.02s 2 2 0.46s 12 29k 0.04s 4 2

Table 1: Experimental Results

approach that most closely resembles our own. In particu-lar, we use version isa-0.11-319-gead5e27 of da and versionfda6009 of fadatool. We use fadatool both with and with-out the -s option, since the results can be wildly different.It should be noted that both tested tools are prototypes.We should therefore be careful about drawing conclusionsfrom these results, especially since fadatool -s sometimesproduces incorrect results. Since the inputs in the table arerelatively simple, correctness was determined through visualinspection.

For each tool and for each test case, we report the timetaken by the analysis, the number of parameters introducedand the number of disjuncts in the dependence relations (forda) or the number of leaves in the quasts (for fadatool).Note that as explained in Section 9 below, the internal rep-resentation of dependence relations inside fadatool includesan additional parameter similar to our β. Since these inter-nal parameters are not explicitly printed in the output ofthe tool, they are not included in the parameter count forfadatool. By contrast, the β parameters are included inthe parameter count for da. The first two inputs are those ofListing 1 and Listing 2, modified to pass through the defaultfadatool parser. In particular, the parser does not supportmultiple writes in a single statement. It was therefore diffi-cult to convert our more extensive test cases. The remainingcases (with some of the names abbreviated) come from thefadalib distribution. We omit those test cases that containindex expressions that are not static affine since we cannothandle them. Of note is that fadatool fails to recognize thatno parameters need to be introduced on Listing 1 and thatwithout the -s option, it quickly runs out of control on thefor test cases. The cascade_if results may be somewhatmisleading since pet recognizes that the filter variables usedin the if conditions are the same and that some of the innertests are implied by the outer test, greatly simplifying theinput to da.

9. RELATED WORKDespite its name, FADA is to the best of our knowledge

the only alternative approach that allows for an exact (butrun-time dependent) dataflow analysis in the presence ofdynamic and/or non-affine conditions or index expressions.The main differences are that FADA introduces different pa-rameters (with a different meaning), that they are only ana-lyzed after all parameters have been introduced and that theanalysis is performed using resolution on general first orderlogic formulas. A new vector of parameters α is introducedfor every maximization problem similar to (4), meaning thatpotentially many more parameters are introduced. The ab-sence of a solution (similar to our β = 0) is represented as ⊥on paper and is reported to be represented by a scalar vari-able similar to our β inside the implementation of [6]. Notethat it is sometimes suggested [8, 22] to use a value outsidethe iteration domain, but this may not always be easy todetermine and is impossible for 0D iteration domains. Ourdataflow analysis on filter arrays is a special case of the it-erative approach of [1].

Other approaches to dataflow analysis [16, 21] produceapproximate results in the presence of constructs that arenot static affine. The approach of [16] in particular prop-agates values to discover static affine constraints in con-structs that do not at first appear to be static affine. Theauthors of [9] propose an algorithm for computing reach-ing definitions for arrays that applies to both structuredand unstructured programs. However, they only focus onhow to collect constraints and do not explain how to solvethem. Instead, they introduce uninterpreted functions andrely on Omega [19] for solving formulas containing such un-interpreted functions. Unfortunately, the support in Omega

for uninterpreted functions is very limited and cannot han-dle the constraints they collect. The iegenlib library [23]has more extensive support for uninterpreted functions, butdoes not support a difference operation and can thereforenot be used to perform value-based dependence analysis.

10. CONCLUSIONS AND FUTURE WORKWe have presented a novel approach for exact array data-

flow analysis in the presence of constructs that are not staticaffine. Dynamic behavior in the input program is repre-sented using filters. An analysis of these filters determines ifthe dependences are also run-time dependent. If so, param-eters are introduced to represent this run-time dependence,where we are careful to introduce as few parameters as possi-ble. This is made possible by a judicious definition of theseparameters. We plan on working on a closer integrationwith pet so that we can perform the dependence analysisincrementally, allowing us to locally treat some variables inthe input as symbolic constants (as advocated by [15]) andmore easily detect some cases (such as that of Listing 1)where no parameters need to be introduced. Note that thetechniques developed in this paper would still be useful onmore complicated inputs.

11. ACKNOWLEDGMENTSThis work was partially funded by a gift received by LIACS

from Intel Corporation and by the European Commissionthrough the FP7 project CARP id. 287767.

12. REFERENCES[1] D. Barthou. Array Dataflow Analysis in Presence of

Non-affine Constraints. PhD thesis, PRiSM -

31

Page 40: ProceedingsofIMPACT2013 - cs.colostate.edu

Laboratoire de recherche en informatique, Feb. 1998.

[2] D. Barthou, A. Cohen, and J.-F. Collard. Maximalstatic expansion. Int. J. Parallel Program.,28(3):213–243, 2000.

[3] D. Barthou, J.-F. Collard, and P. Feautrier.Applications of fuzzy array dataflow analysis. InEuro-Par Conference, volume 1123 of Lect. Notes inComputer Science, pages 424–427, Lyon, Aug. 1996.Springer-Verlag.

[4] D. Barthou, J.-F. Collard, and P. Feautrier. Fuzzyarray dataflow analysis. J. Parallel Distrib. Comput.,40(2):210–226, 1997.

[5] D. Barthou, P. Feautrier, and X. Redon. On theequivalence of two systems of affine recurrenceequations. In Euro-Par Conference, volume 2400 ofLect. Notes in Computer Science, pages 309–313,Paderborn, Aug. 2002. Springer-Verlag.

[6] M. Belaoucha, D. Barthou, A. Eliche, and S.-A.-A.Touati. FADAlib: an open source C++ library forfuzzy array dataflow analysis. In Intl. Workshop onPractical Aspects of High-Level Parallel Programming,volume 1, pages 2075—2084, Amsterdam, TheNetherlands, May 2010.

[7] T. Bijlsma. Automatic parallelization of nested loopprograms for non-manifest real-time stream processingapplications. PhD thesis, University of Twente, 2011.

[8] J.-F. Collard, D. Barthou, and P. Feautrier. Fuzzyarray dataflow analysis. In Proceedings of 5th ACMSIGPLAN Symp. on Principles and practice ofParallel Programming, July 1995.

[9] J.-F. Collard and M. Griebl. A precise fixpointreaching definition analysis for arrays. In Proceedingsof the 12th International Workshop on Languages andCompilers for Parallel Computing, LCPC ’99, pages286–302, London, UK, 2000. Springer-Verlag.

[10] P. Feautrier. Array expansion. In ICS ’88: Proceedingsof the 2nd international conference onSupercomputing, pages 429–441. ACM Press, 1988.

[11] P. Feautrier. Parametric integer programming. RAIRORecherche Operationnelle, 22(3):243–268, 1988.

[12] P. Feautrier. Dataflow analysis of array and scalarreferences. International Journal of ParallelProgramming, 20(1):23–53, 1991.

[13] P. Feautrier. Some efficient solutions to the affinescheduling problem. Part I. One-dimensional time.International Journal of Parallel Programming,21(5):313–348, Oct. 1992.

[14] P. Feautrier. The Data Parallel Programming Model,volume 1132 of LNCS, chapter AutomaticParallelization in the Polytope Model, pages 79–100.Springer-Verlag, 1996.

[15] V. Maslov. Lazy array data-flow dependence analysis.In H.-J. Boehm, B. Lang, and D. M. Yellin, editors,POPL, pages 311–325. ACM Press, 1994.

[16] V. Maslov. Enhancing array dataflow dependenceanalysis with on-demand global value propagation. InICS ’95: Proceedings of the 9th internationalconference on Supercomputing, pages 265–269, NewYork, NY, USA, 1995. ACM.

[17] H. Nikolov, M. Thompson, T. Stefanov, A. Pimentel,S. Polstra, R. Bose, C. Zissulescu, and E. Deprettere.

Daedalus: toward composable multimedia MP-SoCdesign. In Proceedings of the 45th annual DesignAutomation Conference, DAC ’08, pages 574–579,New York, NY, USA, 2008. ACM.

[18] S. Pellegrini, T. Hoefler, and T. Fahringer. Exactdependence analysis for increased communicationoverlap. Recent Advances in the Message PassingInterface, pages 89–99, 2012.

[19] W. Pugh and D. Wonnacott. Going beyond integerprogramming with the omega test to eliminate falsedata dependences. Technical Report Technical ReportCS-TR-3191, Department of Computer Science,University of Maryland, College Park, Maryland, Dec.1992. An earlier version of this paper appeared at theACM SIGPLAN ’92 Conference on PLDI.

[20] W. Pugh and D. Wonnacott. An exact method foranalysis of value-based array data dependences. InProceedings of the 6th International Workshop onLanguages and Compilers for Parallel Computing,pages 546–566. Springer-Verlag, 1994.

[21] W. Pugh and D. Wonnacott. Nonlinear arraydependence analysis. Technical report, College Park,MD, USA, 1994.

[22] T. Stefanov. Converting Weakly Dynamic Programs toEquivalent Process Network Specifications. PhD thesis,Leiden University, Leiden, The Netherlands, Sept.2004.

[23] M. M. Strout, G. George, and C. Olschanowsky. Setand relation manipulation for the sparse polyhedralframework. In Proceedings of the 25th InternationalWorkshop on Languages and Compilers for ParallelComputing (LCPC), Sept. 2012.

[24] A. Turjan, B. Kienhuis, and E. Deprettere.Translating affine nested-loop programs to processnetworks. In CASES ’04: Proceedings of the 2004international conference on Compilers, architecture,and synthesis for embedded systems, pages 220–229,New York, NY, USA, 2004. ACM Press.

[25] S. Verdoolaege. isl: An integer set library for thepolyhedral model. In K. Fukuda, J. Hoeven,M. Joswig, and N. Takayama, editors, MathematicalSoftware - ICMS 2010, volume 6327 of Lecture Notesin Computer Science, pages 299–302. Springer, 2010.

[26] S. Verdoolaege and T. Grosser. Polyhedral extractiontool. In Second International Workshop on PolyhedralCompilation Techniques (IMPACT’12), Paris, France,Jan. 2012.

[27] S. Verdoolaege, G. Janssens, and M. Bruynooghe.Equivalence checking of static affine programs usingwidening to handle recurrences. In Computer AidedVerification 21, pages 599–613. Springer, June 2009.

[28] S. Verdoolaege, H. Nikolov, and T. Stefanov. pn: Atool for improved derivation of process networks.EURASIP Journal on Embedded Systems, special issueon Embedded Digital Signal Processing Systems, 2007,2007.

32

Page 41: ProceedingsofIMPACT2013 - cs.colostate.edu

APPENDIXA. MULTI-VALUED FILTER ACCESS RE-

LATIONSIn this appendix, we describe how to extend the repre-

sentation and manipulation of filters to handle multi-valuedfilter access relations.

A.1 FiltersIn Section 4, we assumed that all filter access relations

access a single element in each iteration. Here, we describehow to extend the definitions to handle multi-valued filteraccess relations. The trickiest part is not so much handlingfilter access relations that access more than one data ele-ment, but filter access relations that may access zero dataelements for some iterations. We therefore allow several fil-ters on the same iteration domain, with disjoint domains,and impose that in each filter the domain of the filter valuerelation is a subset of the domains of all the filter accessrelations. In particular, each statement S has µS filters onits iteration domain, with µS ≥ 0. Each of the filters Fi,with 1 ≤ i ≤ µS , is represented by a sequence of filter accessrelations FS

i,j with 1 ≤ j ≤ nSi and nS

i the number of filter

access relations, and a filter value relation V Si . As in Sec-

tion 4, we have FSi,j ⊆ IS → (IS → A) and V S

i ⊆ IS → ZnSi .

Furthermore, we impose

domFSi,j ⊇ domV S

i ,

for all 1 ≤ i ≤ µS and 1 ≤ j ≤ nSi , to ensure that all

the filter access relations are total on the domains of thecorresponding filter value relations, and

domV Li1 ∩ domV L

i2 = ∅for all 1 ≤ i1, i2 ≤ µS such that i1 6= i2, to ensure that thedomains of the filter value relations are disjoint.

The definition of executedS(k) (5) is then replaced by

∀(f1, . . . , fnSi

) ∈nSi∏

j=1

FSi,j(k) : (V (f j))

nSi

j=1 ∈ V Si (k) (19)

if k ∈ domV Si and is simply true on those parts of the it-

eration domain IS that do not intersect any of the domV Si .

That is, an element of the iteration domain IS is only exe-cuted if the tuples of values of all the tuples of accessed filterarray elements satisfy the active filter value relation.

A.2 Filter Values implied by the SinkIn this section, we describe the derivation illustrated in

Section 6.3.1. We will assume that only a single sink filterand a single potential source filter apply to the domain andrange of M . That is, we will assume that domM1 ⊆ domV C

and ranM1 ⊆ domV P , with M1 the result of projecting outthe array elements from M as in Section 6.3.1. In general,M needs to be split up according to the filters.

Let the sink filter consist of n filter access relations Fi

and the potential source filter of m filter access relationsGj . Since we are going to compare the filters, we replacethe filter access relations by their sources, as explained atthe end of Section 4. In particular, Fi ⊆ IC → (I → A) andGi ⊆ IP → (I → A). We construct a relation H betweentheir possible values, initialized as

H = Zn → Zm.

Let Bj represents the bounds on the values of the arrayelements accessed by Gj as specified by the user through apragma value_bounds [26, Section 2], or Z if no such boundshave been specified. Let B be the Cartesian product of thesebounds, i.e.,

B =

m∏

j=1

Bj . (20)

The relation can then be refined to

H = Zn → B.

In the next step, we iterate over the potential source fil-ter access relations and check if we have any informationabout them in the sink filter value. In particular, we checkif any of the sink filter access relations accesses “the same”value(s). If so, we equate the corresponding dimensions inthe mapping between filter values. To check if “the same”value(s) are accessed, we pull back the filter access relationover M1. This results in a relation between sink domainiterations and filter sources such that there is at least onepotential source corresponding to the sink domain iterationthat accesses that filter source. If this relation is a subsetof the filter access relation at the sink, then we know thateverything we know about the values of the filter accesses atthe sink also applies to the values of the corresponding fil-ter accesses at the corresponding potential source iterations.Note that there may be more than one potential source it-eration associated to a given sink iteration and that each ofthese potential source iteration may access a different ele-ment from the filter array. The above process ensures thatsource and sink values are only equated if all of these filterarray elements are covered by the sink filter access relation.

In particular, for any k ∈ domV C that is executed, weknow from (19),

∀(f1, . . . , fn) ∈n∏

j=1

FCj (k) : (V (f j))

nj=1 ∈ V

C(k).

If we haveGPj ◦M1 ⊆ FC

i , i.e., ∀t ∈M1(k) : GPj (t) ⊆ FC

i (k),then we know

∀t ∈M1(k) :∀(f1, . . . , fm) ∈n∏

j=1

GPj (t) : (V (f j))

nj=1 ∈ V1(k),

(21)with V1 = H ◦ V C . The relation H takes care of projectingout those dimensions in V C for which we were unable to finda corresponding filter access GP

j , reordering those for whichwe did find a correspondence and introducing dimensions forthose filter accesses GP

j for which we were unable to find a

corresponding filter access FCi . The relation V1 ⊆ IC → Zm

represents what we know about the filter values at the po-tential sources associated to a given sink domain iteration,given that the sink domain iteration is executed. A furthercomposition with (the inverse of) M1 ⊆ IC → IP , yields asubset of IP → Zm. This relation maps potential source iter-ations to filter values that allow for one or more correspond-ing sink iterations to be executed. In particular, if there isan element k ∈M−1

1 (t) that is executed, then (V(f j))mj=1 is

an element of V1(k). Given that there is such a k, the tu-ple (V(f j))

mj=1 is therefore an element of the union of V1(k′)

over all k′ ∈ M−11 (t). That is, (V(f j))

mj=1 is an element of

(V1 ◦M−11 )(t). In other words, for values of the filters out-

side the relation V1 ◦M−11 , no corresponding (according to

33

Page 42: ProceedingsofIMPACT2013 - cs.colostate.edu

M1) sink domain iterations are executed. Note that we can-not take the intersection of V1(k′) over all k′ because (21)only applies to those k that are executed and we only knowthat at least one of the k′ ∈ M−1

1 (t) is executed, not thatall of them are executed.

A.3 Filter Values implied by Other SourcesIn this section, we describe the derivation illustrated in

Section 6.3.2. In particular, we describe the “update” func-tion in Line 8 of Algorithm 2. In particular, during the com-putation of the partial lexicographical maximum of M on U(4), the set U may already involve parameters that refer tothe last iterations of (other) potential sources. We describehow we can exploit the constraints on these parameters toderive extra information about the filter values at the sink.The procedure of Section A.2 then needs to be applied tothe updated sink filter to actually derive information aboutthe filter values at the original potential source.

Specifically, we will derive information from the fact thatsome potential source iterations have not been executed.The conditions on the filter values at the potential sourcethat allow the sink to be executed, but not the potentialsource are mapped back to the sink, first by taking the in-tersection over all potential source iterations associated toa given sink iteration and filter element and then by takingthe union over all accessed filter elements. We take the in-tersection over the potential source iterations since we knowthat all those iterations are not executed and so all of thecorresponding constraints apply. We take the union over theaccessed filter elements, since we need to obtain constraintsthat are valid for all of those elements. As in Section A.2, weassume that only a single sink filter and a single potentialsource filter applies to the domain and range of M , i.e., thatdomM1 ⊆ domV C and ranM1 ⊆ domV P . As before, M1

is the result of projecting out the array elements from M .Let us similarly define a U1 that is the result of projectingout the access array element from U . In the remainder ofthis section, we will take Dmem

C,Q to have the array elementsprojected out, i.e., Dmem

C,Q ⊆ IC → IQ.Let us now look at the construction in more detail. We

start with the construction of a mapping from sink iter-ations to iterations of some access Q that have not beenexecuted (according to information in U). We first applythe parametrization of (15) to the iteration domain of Qand construct a relation N0 ⊆ U1 → I ′Q (with I ′Q the re-sult of the parametrization) that is universal, except thatthe first ` dimensions in domain and range are equated.Note that some of the extra parameters in I ′Q also appear inU1 (otherwise we would not consider this potential source).This means that the relation N0 relates sink iterations topotential source iterations that include the last potentialsource iteration executed before the sink iteration. Thatis, {k → λQ

C(k) | k ∈ U1 ∧ σQC (k) ≥ ` } is a subset of

N0. In particular, according to (14), the last element ofDmem

C,Q (k), with k ∈ U1, that shares the first ` iterators andwhere the filter values satisfy the filtered iteration domainof Q is included in this relation. The relation may also con-tain additional elements since we may not have introduceda parameter for each dimension as explained in Section 5.Projecting out all parameters introduced in (15) (for anyiteration domain), we obtain a relation between sink iter-ations and potential source iterations that include the lastpotential source iteration for any value of the other param-

eters. Further combining this relation with a relation map-ping potential source iterations to earlier potential sourceiterations { SQ(i) → SQ(i′) | i < i′ } results in a relationbetween sink iterations and potential source iterations thatmay have executed. In particular, the potential source it-erations that are not related to a given sink iteration (andthat share the first ` iterators) are definitely not executed.To obtain this relation between sink iterations and potentialsource iterations that are definitely not executed we subtractthe relation computed above from the corresponding mem-ory based dependences Dmem

C,Q (with the first ` dimensionsequated). Let us call the resulting relation N ⊆ IC → IQ.Due to the construction, we have for each k → j ∈ N that¬executedSQ(j).

If N is empty, then we cannot use it to derive any in-formation and the computation stops. Otherwise, the firststep in our derivation is to apply the computation of Sec-tion A.2 to N . This assumes that domN ⊆ domV C andranN ⊆ domV Q, for some filter of Q. The first condi-tion can be enforced by intersecting N with domM → IQ,since we are only interested in sink iterations that belongto domM in any case. If we cannot find a filter on Q suchthat the second condition holds, then the computation stops.During the course of the computation, we will remove filteraccess relations GQ

j that are not single-valued. We thereforecheck if the filter on Q has any single-valued filter accessrelations. If not, the computation stops.

Applying the computation in Section A.2 (with M re-placed by N and the potential source P by the other po-

tential source Q), we obtain a relation V0 ⊆ IQ → Zm′such

that for each k→ j ∈ N , we have

∀(f1, . . . , fm′) ∈m′∏

i=1

GQi (j) : (V(f i))

m′i=1 ∈ V0(j).

For the same j, since ¬executedSQ(j), we also know

∃(f1, . . . , fm′) ∈m′∏

i=1

GQi (j) : (V(f i))

m′i=1 6∈ V

Q(j).

Combining these results, we have

∃(f1, . . . , fm′) ∈m′∏

i=1

GQi (j) : (V(f i))

m′i=1 ∈

(V0 \ V Q

)(j).

Unfortunately, knowing that there is some sequence of filterelements f i will not allow us to derive any further informa-tion. We will therefore assume that that all filter accessrelations GQ

i are single-valued. Effectively, this means thatwe remove those filter access relations that are not single-valued and project out the corresponding dimensions fromV0 \V Q. Let V1 be the result of this projection. That is, foreach k→ j ∈ N ,

(V(GQ

i (j)))m′′

i=1∈ V1(j).

Let G be the range product of the single-valued filter accessrelations GQ

i .At this point we have a relation N ⊆ IC → IQ map-

ping sink iterations to corresponding non-executed potential

source iterations, a relation G ⊆ IQ → (I → A)m′′

, map-ping potential source iterations to (single) filter elements,

and a relation V1 ⊆ IQ → Zm′′, mapping potential source

34

Page 43: ProceedingsofIMPACT2013 - cs.colostate.edu

iterations to corresponding filter values that do not allowthe potential source iteration to be executed, but do allowthe corresponding sink iteration to be executed. We firstcombine N and G into a single relation

N1 = N oG−1

The result is a subset of (IC → (I → A)m′′

) → IQ. Wehave, for each (k→ (f)i)→ j ∈ N1,

(V(f i))m′′i=1 ∈ V1(j).

Note that because they represent (part of) a filter, we havedomG ⊇ domV Q. Combined with our assumption thatranN ⊆ domV Q, this ensures that we do not remove anyelements from the range of N . We now connect the pairs ofsink iterations and filter elements to the possible values ofthose filter elements at the corresponding potential sourceiterations by computing

V2 : domN1 → Zm′′: V2(k→ f) =

j∈N1(k→f)

V1(j). (22)

That is, we consider the potential source iterations j thathave definitely not been executed before a certain sink iter-ation k and compute the intersection of the possible valuesof the filter elements f over all those definitely not executedsource iterations. We have, for each k→ (f)i ∈ domV2,

(V(f i))m′′i=1 ∈ V2(k→ (f)i).

Note that different potential source iterations associated tothe same sink iteration may access different elements fromthe filter arrays. We therefore need to be careful to onlycombine constraints on values associated to the same ele-ments. If N1 is single-valued, the intersection in (22) iscomputed over a single element and so we can simply com-pute V2 as

V2 = V1 ◦N1.

Otherwise, it is computed as

V2 = N1 v V −11 ,

with v the non-empty subset operation on two relations,which constructs a relation between the domain elements ofthe two relations such that the image of the first domain ele-ment is a subset of the image of the second domain element.That is, R1 v R2 is equal to

{ s→ t | s ∈ domR1 ∧ t ∈ domR2 ∧R1(s) ⊆ R2(t) }

The constraint R1(s) ⊆ R2(t) can be expressed as

∀u : s→ u ∈ R1 ⇒ t→ u ∈ R2

or

¬∃u : s→ u ∈ R1 ∧ t→ u 6∈ R2

where u may be restricted to ranR1. The operation cantherefore be computed as

R1 v R2 = R−12 ◦R1 \

(((domR2 → ranR1) \R2)−1 ◦R1

).

Finally, we project out filter elements and compute

V3 : IC → Zm′′: V3(k) =

f∈(domV2)(k)

V2(k→ f).

The resulting relation contains all values of all filter elementsread by any non-executed iteration associated to a certainsink iteration. That is, for every k ∈ domV3 = domN ,

∀(f1, . . . , fm′′) ∈m′′∏

i=1

GQi (N(k))) : (V(f i))

m′′i=1 ∈ V3(k).

V3 can be computed as

V3 = V2 ◦ (dom−−→(W−1(domV2)))−1,

with W−1S extracting the nested relation from the set S,i.e.,

W−1S = { s→ t | (s→ t) ∈ S }.The relation V3 may only apply to a subset of the domain ofM1. Since we do not have any information about elementsoutside domV3 = domN , we extend V3 to the entire domainof M1 as

V4 = V3 ∪(

(domM1 \ domN)→ Zm′′).

We now want want to intersect V C with V4, but as inSection A.2 we first need to align the filter access relations,by constructing a relation H mapping the filter values of V4

to those of V C . In this case, however, we also allow extrafilter access relations to be added, both to the original setof filter access relations of the sink and to the filter accessrelations associated to V4. We do this to be able to collect asmuch information as possible at the sink. In particular, someof the sources may involve the same filter access relationseven if these filter access relations do not originally appearamong those of the sink and we still want to combine theinformation from different sources at the sink.

More specifically, we look for filter access relations FCi

that are identical to some GQj ◦N . If we cannot find such an

FCi , we add GQ

j ◦N to the sink filter access relations (adjust-

ing V C). In both cases, we express the correspondence inH. Additionally, we look for filter access relations FC

i thatform a (strict) subset of some GQ

j ◦N . If so, we add FCi to

the filter access relations associated to V4, adjusting V4 byduplicating the constraints on the corresponding dimensionto the new dimension that corresponds to the extra filteraccess relation. Again, we express the correspondence in H.At the end, we apply H to V4 and intersect V C with theresult.

A.4 Avoid InconsistenciesAs explained in Section 6.4, some values of the parame-

ters expressing the last iteration of potential source P , in-troduced in Section 6.2 and constrained by (18), may notbe consistent with the fact that the sink C is executed orwith the values of parameters expressing the last iteration ofother potential sources introduced before. In this section, wedescribe how we can remove some of these inconsistencies.Note that leaving in these inconsistencies does not lead toincorrect results, but only to less accurate results. We there-fore do not need to remove every possible inconsistency, butinstead try to remove those that we can easily discover.

Let us start with inconsistencies that arise from the factthat two potential sources, the current one (P ) and one thatwas considered before (Q), are executed. In particular, as-sume that iteration i of P is executed and iteration j of Q

35

Page 44: ProceedingsofIMPACT2013 - cs.colostate.edu

as well. According to (19), we then have

∀(f1, . . . , fm) ∈m∏

j=1

FPj (i) : (V(f j))

mj=1 ∈ V

P (i) (23)

for some filter of P and

∀(f1, . . . , fm′) ∈m′∏

j=1

FQ1 (j) : (V(f j))

m′j=1 ∈ V

Q(j) (24)

for some filter of Q. If we can find a pair of subsequences offilter access relations that access some sequence of filter ar-ray elements in common, then the corresponding dimensionsof V P (i) and V Q(j) need to have some value in common. LetK ⊆ IP × IQ contain those pairs of iteration that access the

same filter array elements and let H ⊆ Zm × Zm′express

the correspondence of values. That is, H is expressed as oneor more equalities equating pairs of dimensions, one fromV P (i) and one from V Q(j). The relation

T =W−1

((V P × V Q

)−1

(WH)

), (25)

where

WR = { (s→ t) | s→ t ∈ R },then contains those pairs of iterations that allow for a com-mon value. Removing those from K, we obtain the relation

K′ = K \ Tof pairs of iterations that should allow for a common value,but in fact do not. To map these inconsistencies to the pa-rameters we construct a relation D1 from Dmem

C,P (with arrayelements projected out). In particular, we intersect the do-main of Dmem

C,P with the parametrization of (15) and equatethe first ` dimensions of domain and range. We similarlyconstruct a relation D2 from Dmem

C,Q . The inconsistent valuesof the parameters are then obtained as

(D1 oD2) (WK′) (26)

and the result is removed from the sink parametrization.The above procedure may be repeated for any appropriate

pair of K and H. In our implementation, we currently onlyconstruct a single such pair in a greedy way. In particular,we start out with a universal K and H and consider pairs offilter access relations FP

i ⊆ IP → (I → A) and FQj ⊆ IQ →

(I → A) that access the same array. For each such pair, wecompute

Ki,j =(FQj

)−1

◦ FPi ⊆ IP → IQ (27)

and check if Ki,j has a non-empty intersection with the cur-rent value of K. If so, we replace K with this intersectionand adjust H accordingly. Otherwise, we skip this pair offilter access relations.

We have only considered inconsistencies based on pairs ofpotential sources that are executed. It is also possible toderive inconsistencies based on one or both of the potentialsources not being executed. Let us first consider the casewhere P is not executed and Q is executed. In this case,(23) is replaced by

∃(f1, . . . , fm) ∈m∏

j=1

FPj (i) : (V(f j))

mj=1 6∈ V

P (i).

Whereas in the case of (23) and (24) a conflict occurs assoon as there is any sequence of accessed elements for whichno matching value can be found, in this case we only knowsomething about some sequence of accessed elements at Pand we can therefore only arrive at a conflict if all of theaccessed elements are also accessed by Q. In particular, (27)needs to be replaced by

Ki,j = FLPi v FLQ

j ⊆ ILP → ILQ .

Additionally, V P in (25) should be replaced by

(ILP → B) \ V LP

with B the bounds on the possible values on the array el-ements, as computed in (20), while D1 in (26) should referto iterations that are not executed. That is, rather thanintersecting Dmem

C,P with the parametrization of (15), we firstmap this parametrization to lexicographically later elementsand take the union with the set

(σP )→ { SSP (j) | σP < ` }.This second part accounts for the fact that when σP < `,none of the iterations that share the first ` iterators haveexecuted.

The case where P is executed and Q is not executed ishandled similarly. For the case where both P and Q arenot executed, we can only draw any conclusion for thoseiterations that access a single filter element. That is, Ki,j

of (27) is replaced by

{ i→ j ∈ (domFPi )→ (domFQ

j ) |∀f1, f2 ∈ FP

i (i), f3, f4 ∈ FQj (j) : f1 = f2 = f3 = f4 }.

This relation can be computed by removing

dom((FPi n FP

i

)\ (IP → 1I→A)

)

from the domain of(FQj

)−1

◦ FPi and

dom((FQj n FQ

j

)\ (IQ → 1I→A)

)

from its range, with

R1 nR2 = { s→ (t→ u) | s→ t ∈ R1 ∧ s→ u ∈ R2 }.That is, we consider pairs of image elements associated tothe same domain element of FP

i or FQj and remove pairs of

identical image elements. That only leaves pairs of differentimage elements and the domain of the relation representsthose domain elements that have multiple image elementsassociated to them.

Inconsistencies between the potential source P and thesink C are removed in a similar way, except that C is al-ways executed so that we only need to consider two cases,one where P is executed and one where P is not executed.Furthermore, the relation D2 in (26) is replaced by the iden-tity relation on the iteration domain of the sink, i.e., 1IC .

36

Page 45: ProceedingsofIMPACT2013 - cs.colostate.edu

Multifor for Multicore

Imèn FassiDpt of Computer Science

Faculty of Sciences of TunisUniversity El Manar1060 Tunis, Tunisia

[email protected]

Philippe ClaussTeam CAMUS, INRIA

University of Strasbourgboulevard S. Brant

67400 Illkirch, [email protected]

Matthieu KuhnTeam ICPS, LSIIT lab.

University of Strasbourgboulevard S. Brant

67400 Illkirch, [email protected]

Yosr SlamaDpt of Computer Science

Faculty of Sciences of TunisUniversity El Manar1060 Tunis, Tunisia

[email protected]

ABSTRACTWe propose a new programming control structure called“multifor”, allowing to take advantage of parallelization mod-els that were not naturally attainable with the polytopemodel before. In a multifor-loop, several loops whose bodiesare run simultaneously can be defined. Respective iterationdomains are mapped onto each other according to a runfrequency – the grain – and a relative position – the off-set –. Execution models like dataflow, stencil computationsor MapReduce can be represented onto one referential it-eration domain, while still exhibiting traditional polyhedralcode analysis and transformation opportunities. Moreover,this construct provides ways to naturally exploit hybrid par-allelization models, thus significantly improving paralleliza-tion opportunities in the multicore era. Traditional polyhe-dral software tools are used to generate the correspondingcode. Additionally, a promising perspective related to non-linear mapping of iteration spaces is also presented, yieldingto run a loop nest inside any other one by solving the prob-lem of inverting “ranking Ehrhart polynomials”.

Categories and Subject DescriptorsD.3.4 [Programming Languages]: Processors

General TermsPerformance

Keywordsprogramming control structure, parallel programming, poly-tope model

1. INTRODUCTIONWe have definitely entered a new era in programming.

Parallelism is everywhere, from the many-core processor ar-chitectures that are from now on fitting mainstream com-puters, to the software applications that are now mixingintensive computations, specialized routines, network com-munication and multi-threading. Many researches focus inproposing new languages supposed to facilitate program-ming inside this complex environment [8, 10, 11, 13], or in

proposing hardware or software support like transactionalmemory systems [6, 14, 15], supposed to prevent incorrectcomputations while still providing good performance. How-ever, all these proposals face intractable issues. Most pro-posed languages imply to change drastically programmershabits and have weak chances to be adopted by the soft-ware industry [4]. Moreover, even if they offer interestingconstructions to express parallel tasks, they are not solvingthe fundamental problem of correct and efficient parallelismextraction, which induces dependency analysis, data local-ity optimization, task grain adjustment, etc. Performance isalso strongly dependent of their implementation, i.e. of theircompilers or runtime systems. On the other hand, hardwareor software support does not either result in hiding paral-lelization complexity to the user. And overall, these mech-anisms are of high complexity by themselves, which mostlyoften make them unrealistic for a real usage [3].

Nevertheless, parallel programming has already a long his-tory, where gradual extensions have been proposed. Someof them were pretty successful and are still current. For ex-ample, directive-based languages, as OpenMP [2], are ex-tensions to mainstream languages. The use of their in-structions is not mandatory when inserted in a source code,and they can be discovered and adopted progressively byany developer. They are not breaking the programminghabits, while offering efficient parallelization opportunities.Although they are not solving either the fundamental com-plexity of parallelization, they nicely open the door of highperformance computing to anyone.

At the same time, a lot of relevant transformation tech-niques have been discovered in order to exhibit parallelism orto optimize code, particularly on loops, as software pipelin-ing, loop unrolling, tiling, skewing, etc [1, 16, 12]. Theseare applied either by experienced programmers, or auto-matically by compilers or optimization tools. In the par-allelism era, compilers and runtime systems have to accel-erate their progresses in automatic parallelization, but atthe same time, programmers have to be brought to become,at least, who we called “experienced programmers” ten ortwenty years ago.

We argue that a good way to achieve such an emancipa-tion to parallel programming is to gradually extend main-stream languages with new programming control structures

37

Page 46: ProceedingsofIMPACT2013 - cs.colostate.edu

that are derived from already existing ones. Our idea is thatmany well-known optimizing and parallelizing code transfor-mations should now be applied naturally by developers, butonly in their minds, while using a control structure translat-ing their enriched algorithmic reasonings. The existence ofsuch control structures will condition them to enlarge theirway of reasoning while programming. In the same way thatit is currently natural to express the repetition of code blocksby using loops, or to abstract parametrized code by usingfunctions, it should now be natural to bring closer instruc-tions that are using the same operands, or to arrange codesnippets in vertical and horizontal directions to express si-multaneity, sequencing and overlapping.

Following this idea, we propose a new control structurecalled Multifor, which can be seen as an extension of for-loops to multiple for-loops running at the same time, whoserespective instructions are run sequentially inside one loopbody as a traditional for-loop, but run in any interleavedorder, or in parallel, between the bodies. Additionally totraditional parameters as indices, bounds and steps, we pro-pose to introduce a grain and an offset, allowing to mix loopsof different execution frequencies and of different startingpositions. Such programming construction translates natu-rally to code transformations as loop fusion, loop pipeliningor loop partitioning. Moreover, it facilitates many code op-timizations as data locality improvement or parallelization.It can be seen as an extension of for-loops from “vertical” to“horizontal” programming.

The second motivation of this proposal is related to theparallel programming models that are covered by the poly-tope model. Traditional application of this model does notallow to naturally express parallel programming models astask parallelism, dataflow or MapReduce. We show thatthe multifor construct allows to schedule loop nest processesby mapping together their respective iteration domains. Amultifor code can be represented geometrically by a partic-ular union of polyhedra, each being previously dilated, com-pressed or translated, either entirely or partially, accordingto transformation factors defined by constants or affine func-tions.

This new programming structure implies interesting im-plementation challenges for a compiler, from its front-end toits backend. We show that, as it is already the case withfor-loops, the polytope model is quite well adapted to ana-lyze, optimize and translate multifor constructs into efficientcode.

Finally, as a promising perspective, we also propose a non-linear mapping of the iteration domains guided by the ranksof the iterations. This approach opens the possibility ofmapping any iteration domain onto any other domain with-out being constrained by their shapes. It leads to solve thegeneral problem of executing any loop nest by any otherloop nest of the same trip count. Mathematically speaking,the general solution to this problem is based on inverting“ranking Ehrhart polynomials” [5, 9].

The paper is organized as follows. In the next section,syntax and semantics of multifor loops are described, illus-trated with a few examples of multifor headers and graphicalrepresentations. We also highlight the code parallelizationand transformation schemes that are possible with multifor-loops. In Section 3, we discuss the main issues related tothe implementation of multifor constructs and their corre-sponding code generation. Several real and representative

code examples are presented in Section 4, highlighting theinteresting contributions of this new control structure. Thepromising perspective of non-linear iteration space mappingis the topic of Section 5, where a solution for inverting rank-ing Ehrhart polynomials is proposed. Finally, conclusionsand further perspectives are given in Section 6.

2. SYNTAX AND SEMANTICSIn this paper, we describe the initial syntax and seman-

tics for the multifor construct. However, they can be ex-tended in many ways in the future. We first present thecase of one unique multifor construct, as the case of nestedmultifor-loops present some specificities which are presentedafterwards.

2.1 Non-nested multifor-loopsThe multifor syntax is defined by:

multifor ( index1 = expr, [index2 = expr, ...] ;index1 < expr, [index2 < expr, ...] ;index1+ = cst, [index2+ = cst, ...] ;grain1, [grain2, ...] ;offset1, [offset2, ...] ) {prefix : {statements}

}

where [...] denotes optional arguments, indexi denotes theindices of the loops composing the multifor, expr denotesaffine arithmetic expressions on enclosing loop indices, orconstants, cst denotes an integer constant, grain and offsetare positive integers, grain ≥ 1, offset ≥ 0, and prefix isa positive integer associating each statement to a given for-loop composing the multifor-loop, according to the order inwhich they are defined (0 for the first loop, 1 for the secondloop, etc.). Without loss of generality, we consider in thefollowing that the index steps, cst, always equal one, sincethe general case can be easily deduced.

Each for-loop composing the multifor behaves as a tradi-tional for-loop, but all are mapped on a same global “virtualreferential” domain, which can also be seen as a template.The way iterations of the for-loops are mapped is definedby their respective offset and grain. The grain defines thefrequency in which the associated loop has to run, relativelyto the referential. For instance, if the grain equals 2, thenone iteration of the associated loop will run over 2 iterationsof the referential. The offset defines the gap between thefirst iteration of the referential and the first iteration of theassociated loop. For instance, if the offset equals 3, then thefirst iteration of the associated loop will run at the fourthiteration of the referential loop.

The size and shape of the referential is deduced from thefor-loops composing the multifor-loop. Geometrically, it isdefined as the disjoint union of the for-loop domains, whereeach domain has been previously shifted according to its off-set and dilated according to its grain. The disjoint union isthe union of adjacent convex domains, each being scannedby a referential for-loop. The relative positions of the it-erations of the for-loops composing the multifor-loop insidethe referential depends of the overlapping of their respec-tive domains. It means that on domains where only onefor-loop iterations are run, the grain becomes a compressionfactor. In general, the greatest common divisor of the grainsof all the for-loops overlapping on a same referential domain

38

Page 47: ProceedingsofIMPACT2013 - cs.colostate.edu

is used as the factor for compressing the points of this ref-erential domain, according to the lexicographic order. Ondomains where several for-loops iterations are run, these arerun in interleaved fashion, or simultaneously.

Let us illustrate this definition with a few examples. Con-sider the following multifor-loop header:

multifor (i1 = 0, i2 = 10; i1 < 10, i2 < 15; i1++, i2++; 1, 1; 0, 2)

In this example, the offset of index i1 is zero, and the one ofindex i2 is 2. Thus, the first iteration of the i1-loop will runimmediatly, while the first iteration of the i2-loop will runat the 3rd iteration of the multifor, but with i2 = 0. Thisbehavior is illustrated by the figure below:

i1

i2

i

Notice that the index values have no effect on the relativepositions of the for-loops bodies, which are uniquely deter-mined by the grain and the offset. Another example is:

multifor (i1 = 0, i2 = 10; i1 < 10, i2 < 15; i1++, i2++; 1, 4; 0, 0)

Now, the i1-grain is 1 and the i2-grain is 4. In such a case,for one iteration of the i2-loop, four iterations of the i1-loopwill be run on the domain on which they overlap. The seconddomain is compressed by a factor of 4, since only the i2-loopis run, as it is illustrated below:

i1

i2

i

2.2 Nested multifor-loopsNested multifor-loops present some particularities and spe-

cific semantics has to be described. Without loss of gener-ality, let us consider two nested multifor-loops composed oftwo for-loop nests:

multifor ( index1 = expr, index2 = expr;index1 < expr, index2 < expr;index1+ = cst, index2+ = cst;grain1, grain2;offset1, offset2 ) {prefix : {statements}

multifor ( index3 = expr, index4 = expr;index3 < expr, index4 < expr;index3+ = cst, index4+ = cst;grain3, grain4;offset3, offset4 ) {prefix : {statements}

}prefix : {statements}

}Such a nest behaves as two for-loop nests, (index1, index3)and (index2, index4) respectively, running simultaneously inthe same way as it is for one unique multifor-loop. Thegrain of the inner multifor-loop introduces a delay for theassociated for-loop, since the same reasoning as with thenon-nested case is applied at each loop depth. The lowerand upper bounds are affine functions of the enclosing loopindices of the same for-loop1. Let us consider some examplesof nested multifor headers.

1Notice that this restriction could be evicted for some amaz-ing extensions.

multifor (i1 = 0, i2 = 0; i1 < 10, i2 < 5; i1 ++, i2 ++; 1, 1; 0, 2)multifor (j1 = 0, j2 = 0; j1 < 10, j2 < 5; j1 ++, j2 ++; 1, 1; 0, 2)

The second for-loop nest has a 2-offset at each loop depth.Hence it is delayed in each dimension of the referential do-main:

ij

:itérations (i1,j1)

:itérations (i1,j1) and (i2,j2)

multifor (i1 = 0, i2 = 0; i1 < 10, i2 < 3; i1 ++, i2 ++; 1, 4; 0, 0)multifor (j1 = 0, j2 = 0; j1 < 10, j2 < 3; j1 ++, j2 ++; 1, 4; 0, 0)

The second for-loop nest has a 4-grain at each loop depth.Hence its iterations are spaced by 4 in each dimension of thereferential domain:

ij

:itérations (i1,j1)

:itérations (i1,j1) and (i2,j2)

multifor (i1 = 0, i2 = 0; i1 < 6, i2 < 6; i1 ++, i2 ++; 1, 1; 0, 1)multifor (j1 = 0, j2 = 0; j1 < 6− i1, j2 < 6; j1 ++, j2 ++; 1, 1; 0, 0)

In this example, the upper bound of the inner loop of thefirst loop nest is an affine function.

ij

:itérations (i1,j1)

:itérations (i1,j1) and (i2,j2)

:itérations (i2,j2)

2.3 Multifor-loop parallelization and code trans-formations

The multifor construct exhibits a straightforward paral-lelization strategy which is to run, at each iteration, theloop bodies of the defined for-loops in parallel. This modelof parallelization provides new opportunities to the poly-tope model, since it enables the expression of parallel pro-gramming models that were unattainable before, as dataflowcomputing or fixed-depth MapReduce, as it will be shownon code examples in Section 4.

Nevertheless, the multifor construct still allows OpenMP-like loop parallelization for each for-loop of the multifor, thusproviding hybrid parallelization strategies.

39

Page 48: ProceedingsofIMPACT2013 - cs.colostate.edu

Moreover, each for-loop, or for-loop nest, of a multifor, canbe transformed using any well-known polyhedral transfor-mation. However, in the context of a multifor, these trans-formations may be guided by the interactions between thefor-loops, in order to achieve better performance or betterdata locality for instance. Another opportunity is the trans-formation of imperfect loop nests into multifor-loop nests ofperfectly nested loops.

3. IMPLEMENTATION ISSUES

3.1 Reference domainConsider one multifor-loop level. The referential for-loops

cadencing the multifor execution have the constraint of scan-ning a sufficient number of iterations. Let us denote by fthe number of for-loops. By computing the disjoint union ofall for-loops iteration domains, we obtain a set of adjacentdomains Di on which some of the f loops overlap. Let usdenote by lbi, ubi, graini and offseti, i = 1..f the parame-ters characterizing each for-loop in the multifor header. Letus set nlbi = offseti and nubi = (ubi − lbi + 1)× graini +offseti, which define the lower and upper bounds of eachloop in the referential domain, since the computation of nlbiconsists in translating the domain and the computation ofnubi in dilating the domain by a factor which equals thegrain. The disjoint union of Di’s is computed using theselatter bounds. The initial index value of the referential do-main is MINi=1..f (nlbi). Hence, the total number of itera-tions in the referential domain, before compression, is:

MAXi=1..f (nubi)−MINi=1..f (nlbi) + 1

In order to generate the referential for-loops for each domainDi, the last step consists in compressing each Di by a factordefined by lcm(grainj), for all loops j overlapping on Di.

More generally for any multifor-loop nest, the computa-tion of the referential domain is performed in three steps.First, each iteration domain associated to one for-loop nestcomposing a multifor-loop nest is translated from the originaccording to its offsets, and dilated according to its grainsin every dimension. Notice that values actually taken bythe indices of the for-loop nests are not defining their posi-tions in the referential domain. Second, a disjoint union iscomputed and resulting in a union of adjacent convex do-mains Di. Third, each Di may be compressed according tothe greatest common divisor of the grains of the associatedfor-loop nests, and according to the lexicographic order.

3.2 Code generationWhen considering sequential code, there are two dual ways

to generate the code corresponding to a multifor-loop nest.A first way is to generate loop nests scanning the referentialdomains through a minimal set of convex domains, and to in-sert guards in their bodies in order to execute the convenientinstructions at each iteration. The number of these guardscan be optimized by computing their common sub-domains.The second way is to scan each Di using a dedicated loopnest with a constant loop body, without guards.

Both solutions can be generated automatically using poly-tope model tools like PolyLib or CLooG, and by insertingphases to compute the translated and dilated domains, orto compress parts of the resulting referential domains.

If for-loops of given depth of a multifor-loop nest have tobe run in parallel, each for-loop has to be run in a sepa-

for (i = 0; i < K; i++)for (j = 0; j < N ; j ++)a[i][j] = ReadImage();

for (i = 1; i < K − 1; i++)for (j = 1; j < N − 1; j ++){Sbl[i][j] = Sobel(a[i− 1][j − 1], a[i][j − 1], a[i+ 1][j − 1],

a[i− 1][j], a[i][j], a[i+ 1][j],a[i− 1][j + 1], a[i][j + 1], a[i+ 1][j + 1]);

WriteImage(Sbl[i][j]);}

Figure 1: Sobel edge detection code

multifor (i1 = 0, i2 = 1; i1 < K, i2 < K − 1;i1 ++, i2 ++; 1, 1; 0, 3)

multifor (j1 = 0, j2 = 1; j1 < N, j2 < N − 1;j1 ++, j2 ++; 1, 1; 0, 3) {

0 : a[i1][j1] = ReadImage();1 : {Sbl[i2][j2] = Sobel(a[i2 − 1][j2 − 1], a[i2][j2 − 1],

a[i2 + 1][j2 − 1], a[i2 − 1][j2],a[i2][j2], a[i2 + 1][j2],a[i2 − 1][j2 + 1], a[i2][j2 + 1],a[i2 + 1][j2 + 1]);

WriteImage(Sbl[i2][j2]); }}

Figure 2: Sobel edge detection multifor code

rated thread and all threads have to be synchronized at themultifor-loop completion. Notice that this could be enrichedby providing OpenMP-like options as NOWAIT.

Original indices of the multifor (i1, i2, j1, ...) have to beretrieved at the beginning of each loop body, by being com-puted from the referential loop indices. These computationsconsists in subtracting offsets, or in adding modulos of thereferential loop indices relatively to grains, or in multiplyingby grains in case of compressed domains.

4. EXAMPLESSobel edge detection: We first consider the code for per-forming Sobel edge detection of an image shown in Figure1. The first loop nest of this program reads the input im-age, while the second loop nest performs the actual edgedetection and writes out the output image.

Note that nine neighboring elements have to be read be-fore its resulting pixel can be computed and written. Henceboth loop nests can be naturally overlapped by writing thecode using the multifor construct exhibiting a data-flow modelof computation, shown in Figure 2. The associated referen-tial domain is shown in Figure 3.

Red-Black Gauss-Seidel: The second example is the Red-Black Gauss-Seidel algorithm composed of two phases. Thefirst phase consists in updating the red elements of a grid,which are one point over two in the i and j directions of thegrid, starting from the first bottom left corner, using theirNorth-South-East-West (NSEW) neighbors, which are blackelements. The second phase consists obviously in updatingthe black elements from the red ones. For a 2D N×N prob-lem, the usual code, is of the form shown in Figure 4 (theborder elements initialization has been omitted).

On the iteration domain and at each phase, a different

40

Page 49: ProceedingsofIMPACT2013 - cs.colostate.edu

ij

:itérations (i1,j1)

:itérations (i1,j1) and (i2,j2)

:itérations (i2,j2)

Figure 3: Sobel edge detection referential domain

// Red phasefor (i = 1; i < N − 1; i++)for (j = 1; j < N − 1; j ++)if ((i+ j) % 2 == 0)u[i][j] = f(u[i][j + 1], u[i][j − 1], u[i− 1][j], u[i+ 1][j]);

// Black phasefor (i = 1; i < N − 1; i++)for (j = 1; j < N − 1; j ++)if ((i+ j) % 2 == 1)u[i][j] = f(u[i][j + 1], u[i][j − 1], u[i− 1][j], u[i+ 1][j]);

Figure 4: Red-Black Gauss-Seidel code

lattice of iterations is active. Moreover, the NSEW depen-dencies prevent any linear parallel schedule. This code canbe translated into a multifor-loop nest where the red andblack phases each yield two for-loop nests with convenientgrains and offsets as shown in Figure 5.

The referential initial domain of the multifor code is repre-sented in Figure 6 on the left, also showing the transformeddependency vectors which are now allowing a linear sched-ule. On the right in Figure 6, the final iteration domains,after compression, are represented.

This example shows that the multifor construct allows toexploit different kind of parallelism – the so-called wavefrontparallelism in this case – since the traditional parallelizationconsists in parallelizing each of the phases, and to executeeach phase one after the other. The multifor strategy canbe preferable to improve data locality and thus improve the

multifor (i0 = 1, i1 = 2, i2 = 1, i3 = 2; i0 < N − 1, i1 < N − 1,i2 < N − 1, i3 < N − 1; i0+ = 2, i1+ = 2, i2+ = 2,i3+ = 2; 2, 2, 2, 2; 0, 1, 0, 1)

multifor (j0 = 1, j1 = 2, j2 = 2, j3 = 1; j0 < N − 1, j1 < N − 1,j2 < N − 1, j3 < N − 1; j0+ = 2, j1+ = 2, j2+ = 2,j3+ = 2; 2, 2, 2, 2; 0, 1, 1, 0) {

0 : u[i0][j0] =f(u[i0][j0 + 1], u[i0][j0 − 1], u[i0 − 1][j0], u[i0 + 1][j0]);

1 : u[i1][j1] =f(u[i1][j1 + 1], u[i1][j1 − 1], u[i1 − 1][j1], u[i1 + 1][j1]);

2 : u[i2][j2] =f(u[i2][j2 + 1], u[i2][j2 − 1], u[i2 − 1][j2], u[i2 + 1][j2]);

3 : u[i3][j3] =f(u[i3][j3 + 1], u[i3][j3 − 1], u[i3 − 1][j3], u[i3 + 1][j3]);

}

Figure 5: Red-Black Gauss-Seidel multifor code

ij

:itérations (i0,j0) (red phase)

:itérations (i2,j2) (black phase)

:holes

:itérations (i1,j1) (red phase)

:itérations (i3,j3) (black phase)

ij

Figure 6: Red-Black Gauss-Seidel referential do-main

for (j = 1; y < N − 1; j+ = 2)u[i][j] = f(u[i][j + 1], u[i][j − 1], u[i− 1][j], u[i+ 1][j]);for (i = 2;x < N − 2; i+ = 2) {for (j = 2; j < N − 1; j+ = 2) {u[i][j] = f(u[i][j + 1], u[i][j − 1], u[i− 1][j], u[i+ 1][j]);u[i][j + 1] = f(u[i][j + 2], u[i][j],

u[i− 1][j + 1], u[i+ 1][j + 1]); }for (j = 1; j < N − 1; j+ = 2) {u[i+ 1][j] = f(u[i+ 1][j + 1], u[i+ 1][j − 1],

u[i][j], u[i+ 2][j]);u[i+ 1][j + 1] = f(u[i+ 1][j + 2], u[i+ 1][j],

u[i][j + 1], u[i+ 2][j + 1]); } }for (j = 2; y < N − 1; j+ = 2) {u[N − 2][j] = f(u[N − 2][j + 1], u[N − 2][j − 1],

u[N − 3][j], u[N − 1][j]); }

Figure 7: Red-Black Gauss-Seidel generated code

resulting execution time. Here, some parallelism within thered points and within the black points can still be exploited,but also parallelism between red and black points. As anexample, we show as a source code the sequential code thatcould be generated from this multifor-loop nest in Figure 7.

Notice also that more generally, when dependencies allowit, such a decomposition of a loop-nest computation intoseparated lattices, and expressed as a multifor-loop nest,provides another parallelization strategy that may be oftenquite interesting due to data locality issues.

Matrix product by blocks: The third example is analgorithm to compute the product of two matrices n × n,(A × B = C), by partitioning the matrices into uniformblocks. The matrix product is then carried out block byblock. We split the two matrices A and B as follows:

• matrix A is divided into two matrices A1 and A2 whosedimension is n/2× n.

• matrix B is divided into two matrices B1 and B2 whosedimension is n× n/2.

The product A × B = C translates to four products: A1 ∗B1 = C1, A1 ∗B2 = C2, A2 ∗B1 = C3 and A2 ∗B2 = C4, i.e.,

(A1

A2

(B1 B2

)=

(C1 C2

C3 C4

)

41

Page 50: ProceedingsofIMPACT2013 - cs.colostate.edu

multifor (i1 = 0, i2 = 0, i3 = n/2, i4 = n/2;i1 < n/2, i2 < n/2, i3 < n, i4 < n;i1 ++, i2 ++, i3 ++, i4 ++;1, 1, 1, 1 ; 0, 0, 0, 0 ) {

multifor (j1 = 0, j2 = n/2, j3 = 0, j4 = n/2;j1 < n/2, j2 < n, j3 < n/2, j4 < n;j1 ++, j2 ++, j3 ++, j4 ++;1, 1, 1, 1 ; 0, 0, 0, 0 ) {

0 : c[i1][j1] = 0;1 : c[i2][j2] = 0;2 : c[i3][j3] = 0;3 : c[i4][j4] = 0;

multifor (k1 = 0, k2 = 0, k3 = 0, k4 = 0;k1 < n, k2 < n, k3 < n, k4 < n;k1 ++, k2 ++, k3 ++, k4 ++;1, 1, 1, 1 ; 0, 0, 0, 0 ) {

0 : c[i1][j1] = c[i1][j1] + a[i1][k1]× b[k1][j1];1 : c[i2][j2] = c[i2][j2] + a[i2][k2]× b[k2][j2];2 : c[i3][j3] = c[i3][j3] + a[i3][k3]× b[k3][j3];3 : c[i4][j4] = c[i4][j4] + a[i4][k4]× b[k4][j4];}

}}

Figure 8: Multifor matrix product code

The dimension of matrices C1, C2, C3 and C4 is (n/2×n/2).These four products might be performed simultaneously andcan be naturally expressed using the multifor structure asshown in figure 8.

Geometrically, the multifor iteration domain is a (n/2 ×n/2×n) rectangle parallelepiped where each point is associ-ated to four iterations of the four included loop-nests. Thisexecution scheme corresponds to the MapReduce strategysince each for-loop nest computes a n/2 × n/2 sub-block(map step), and the combination of all sub-blocks forms theresulting matrix C (reduce step).

Steganography: The fourth example is the decoding phaseof a steganography code where an hidden image is extractedfrom an enclosing one. It is assumed that the upper leftpixel of the hidden image is hidden within the upper leftpixel of the enclosing image; HWidth and HHeight are thewidth and the height of the hidden image ; EWidth andEHeight are the width and the height of the enclosing im-age ; EImage is the image hiding another image ; HImageis the extracted output image that was hidden ; MImage isthe output enclosing image hiding no more the image thatwas hidden in EImage. The proposed multifor code versionis composed of four simultaneous for-loop nests, the firstbeing dedicated to the extraction of the hidden image, thesecond to the extraction of the part of the enclosing imagewhich is hiding the hidden image, the third and the fourthbeing dedicated to copy the pixels directly to the retrievedenclosing image. Since the union of the third and fourth do-main is not convex, two loop-nests are necessary to scan it.Notice that we introduce a shortcut in the syntax such thatsimilar loop bodies can be instantiated differently dependingon their associated loop-nest. The multifor code is shown inFigure 9 and the referential iteration domain in Figure 10.Notice that full parallelism is exhibited with this code.

Secret key cryptosystem: The fifth example is a classicsecret key cryptosystem that manipulates binary words. It

RGBApixel decode hidden(i, j){RGBApixel P ixel1 = ∗EImage(i, j);RGBApixel P ixel2;Pixel2.Red = Pixel1.Red%2;Pixel2.Green = Pixel1.Green%2;Pixel2.Blue = Pixel1.Blue%2;Pixel2.Alpha = Pixel1.Alpha%2;returnP ixel2;}

RGBApixel decode main(i, j){RGBApixel P ixel1 = ∗EImage(i, j);RGBApixel P ixel2;Pixel2.Red = Pixel1.Red− Pixel1.Red%2;Pixel2.Green = Pixel1.Green− Pixel1.Green%2;Pixel2.Blue = Pixel1.Blue− Pixel1.Blue%2;Pixel2.Alpha = Pixel1.Alpha− Pixel1.Alpha%2;returnP ixel2;}

multifor (i1 = 0, i2 = 0; i3 = 0, i4 = HWidth; i1 < HWidth,i2 < HWidth, i3 < HWidth, i4 < EWidth;i1 ++, i2 ++, i3 ++, i4 ++; 1, 1, 1, 1; 0, 0, 0, 0)

multifor (j1 = 0, j2 = 0, j3 = HHeight, j4 = 0; j1 < HHeight,j2 < HHeight, j3 < EHeight, j4 < EHeight;j1 ++, j2 ++, j3 ++, j4 ++; 1, 1, 1, 1; 0, 0, 0, 0)

{0 : // Retrieve the hidden image∗HImage(i1, j1) = decode hidden(i1, j1);

1 : // Retrieve the enclosing image∗MImage(i2, j2) = decode main(i2, j2);

[2, 3] : // Retrieve the enclosing image∗MImage([i3, i4], [j3, j4]) = ∗EImage([i3, i4], [j3, j4]);

}

Figure 9: Multifor steganography code for the de-coding phase

proceeds by splitting a message m into blocks of constantsize. These cryptosystems are characterized by the lengthof each block, the operating mode and the encryption systemof each block. Each cipher mode comprises:

1. Cutting in many blocks m1, ...,mk the plain text mes-sage m;

2. Encrypting the blocks mi resulting in the encryptedblocks c1, ..., ck;

3. Concatenating the blocks c1, ..., ck to construct the en-crypted message c.

Each block is encrypted through the product of two cryp-tosystems T1 and T2. It is classically computed using a loopof the form:

for (i = 0; i < k; i++) {c[i] = Encrypt(m[i], T1);c[i] = Encrypt(c[i], T2); }

Suppose the encryption of each block by a given cryp-tosytem consumes one unit of time. The time required toencrypt the entire message using this loop is 2 × k. Let uswrite this code using a multifor structure:

42

Page 51: ProceedingsofIMPACT2013 - cs.colostate.edu

ij

:itérations (i1,j1) and (i2,j2)

:itérations (i3,j3)

:itérations (i4,j4)

HHeight

EHeight

HW

idth

EW

idth

Figure 10: Referential domain for the multiforsteganography code

multifor (i1 = 0, i2 = 0; i1 < k, i2 < k;i1 ++, i2 ++; 1, 1; 0, 1) {

0 : c[i1] = Encrypt(m[i1], T1);1 : c[i2] = Encrypt(c[i2], T2); }

This form allows to take advantage of a pipeline schemewhere two successive blocks are encrypted in parallel re-spectively by the cryptosystems T1 and T2. Thus, the timerequired to encrypt the message is k + 1.

5. A PROMISING PERSPECTIVE:NON-LINEAR MAPPING

Among the numerous possible extensions, an importantone is to map iteration spaces together following a non-linearfashion, such that their shapes has no influence in the map-ping. In general, this would allow to execute iterations of anyloop nest by any other one of the same trip count, and thusto enlarge significantly the way iterations of different loopscan be mapped together. Hence a multifor construct couldexpress quite different computations in a concise way, andaugment the number of optimization and parallelization op-portunities. Notice that loop nests with different trip countscan also be handled by splitting the largest nest such thatone of the resulting nest has the convenient trip count.

Our idea is based on ranking Ehrhart polynomials. It hasbeen shown in previous works dealing with spatial data lo-cality optimization, that it is possible to compute an Ehrhartpolynomial associated to a loop nest giving the rank of aniteration [5, 9]. These polynomials have specific propertiesas being necessarily monotonically increasing according tothe lexicographic order of the loop indices, and also defin-ing a bijection between iteration points and the interval ofstrictly positive integers between one and the total iterationcount of the loop nest.

Since ranking Ehrhart polynomials define such bijections,the ability of inverting them would provide a way to retrievethe loop indices corresponding to the rank of an iteration.Hence at any iteration of a loop nest, it would be possibleto compute, from the rank, the values of loop indices thatwould have been reached while running another nest. Thus,any nest could be run by another one, and any iterationspace could be mapped onto another one, following the rankof the iterations. Hence this Section deals with the problemof inverting ranking Ehrhart polynomials.

Before proposing a general resolution, we first present a2-dimensional example.

5.1 2-dimensional exampleConsider the two loop nests in listings (1) and (2), where

instructionk(i1, i2) denotes the instruction block computedat iteration (i1, i2). These loops have as respective rankingEhrhart polynomials:

P1(i, j) =i(i− 1)

2+ j and P2(i′, j′) = i′M2 + j′

for (i = 1, i < N, i++)for (j = 0, j < i, j ++)

instructions1(i, j);(1)

for (i′ = 0, i′ < M1, i′ ++)for (j′ = 0, j′ < M2, j′ ++)

instructions2(i′, j′);(2)

Taking the assumptions that both nests have the sameiteration count and that there is no dependency betweeninstructions1 and instructions2, we could merge the twoformer loop nests and write (3).

for (i′ = 0, i′ < M1, i′ ++)for (j′ = 0, j′ < M2, j′ ++) {

instructions1(i, j);instructions2(i′, j′); }

(3)

However, we need to express indices (i, j) as a function of(i′, j′) in order to preserve the execution order of the blockinstructions1. More precisely, for each iteration number Kin loop nest (3), we want to execute the Kth iteration of loopnest (1). This is why we must invert the ranking Ehrhartpolynomial P1, to compute (i, j) = P−1

1 (P2(i′, j′)).For any rank K, we have to find a couple of indices (i0, j0)

such that P1(i0, j0) = K. The main idea is to cut the 2-dimensional problem into two one-dimensional problems.

Let us define Q1(i) = P1(i, 0) = i(i−1)2

. As rankingEhrhart polynomials are (strictly) increasing, the followingrelation holds:

Q1(i0) = P1(i0, 0) ≤ P1(i0, j0) = K ≤ P1(i0 + 1, 0) = Q1(i0 + 1)

And, for the same reason, we know that index i0 is uniqueon N+. Let us now consider polynomial Q1 as a polynomialover R. By continuity of Q1 over R, there exists α ∈ [0, 1[such that Q1(i0 + α) = K. This shows that the equationQ1(x) = K has at least one real solution. So we have to findx such that:

Q1(x) = K ⇔ Q1(x)−K = 0⇔ x(x− 1)

2−K = 0

Obviously, this last equation has two real roots:

x1 =1

2−√

1 + 8K

4, x2 =

1

2+

√1 + 8K

4

To select the convenient root, we notice that for all K > 0,x1 ≤ 0. As i0 ≥ 1 (according to the loop bounds in (1)),i0 + α = x2, and thus: i0 = bx2c. We can now replace i0 byits value in P1(i0, j0):

P (i0, j0) =1

2

(⌊1

2+

√1 + 8K

4

⌋)(⌊1

2+

√1 + 8K

4

⌋− 1

)+j0

and finally deduce j0:

j0 = K − 1

2

(⌊1

2+

√1 + 8K

4

⌋)(⌊1

2+

√1 + 8K

4

⌋− 1

)

The resulting code is shown in listing 4.

43

Page 52: ProceedingsofIMPACT2013 - cs.colostate.edu

K = 0;for (i′ = 0, i′ < M1, i′ ++)

for (j′ = 0, j′ < M2, j′ ++) {K ++;i = floor(sqrt((1 + 8 ∗K)/4) + 1/2);j = K − i ∗ (i− 1)/2;instructions1(i, j);instructions2(i′, j′); }

(4)

We now present the general case, which can be easily de-duced from the 2-dimensional case.

5.2 General caseWithout any loss of generality, we assume all loop indices

lower bounds equal 0. We consider the N -dimensional rank-ing Ehrhart polynomial P (i, j, k, ...), and for each K, we seekthe tuple (i0, j0, k0, ...) such that P (i0, j0, k0, ...) = K.

Similarly to the previous example, we start with the out-ermost loop index i0. We define Qi(i) = P (i, 0, 0, ..., 0) andsolve Qi(x)−K = 0. Here is the only issue that differs fromthe 2D case: as we can’t state that Qi is monotonically in-creasing on R+, we have to find a criterion to select the rootgiving the sought index.

First, we obviously eliminate complex and negative solu-tions, since i0 ∈ N+, and consider n ≤ N positive real roots{x1, .., xn}. As we know that i0 is unique, a way to select theconvenient root is to check which x ∈ {x1, .., xn} satisfies :

Qi(bxc = i0) ≤ Qi(x) ≤ Qi(dxe = i0 + 1)

However, this strategy is only applicable at runtime, andmay add non negligible time overhead. A compile-time so-lution is to check if Qi is monotonically increasing on R+

by examining its derivative. If so, any root in {x1, .., xn} issuitable.

Once i0 has been found, the process starts again withQj(j) = P (i0, j, 0, ..., 0), and so on until all indices havebeen computed.

6. CONCLUSIONWe have proposed a new programming control structure

called “multifor” and showed that it allows the polytopemodel to handle programming models that were not attain-able directly before. Important related theoretical studieshave still to be conducted, as dependency analysis betweenthe for-loops composing a multifor-loop, or optimizing codetransformations that considers interactions between the for-loops.

Many interesting extensions can also be studied as mak-ing header parameters, or instructions, dependent of severalfor-loop indices composing the same multifor-loop level, ordefining non-invariant grains and offsets, or introducing con-ditionals on the effective run of the for-loops, etc. In thispaper, we showed that it may be possible to handle non-linear mapping of iteration spaces using inverted rankingEhrhart polynomials.

The multifor structure can also be used as a representa-tion model for some interacting mechanisms, as concurrentmemory accesses, as it is done for sequential codes in [7].

We are planning to implement multifor structures in theClang/LLVM compiler as an extension to C/C++.

7. REFERENCES[1] U. Banerjee. Loop Transformations for Restructuring

Compilers - The Foundations. Kluwer AcademicPublishers, 1993. ISBN 0-7923-9318-X.

[2] O. A. R. Board. Openmp application programinterface, version 3.1, 2011.

[3] C. Cascaval, C. Blundell, M. Michael, H. W. Cain,P. Wu, S. Chiras, and S. Chatterjee. Softwaretransactional memory: Why is it only a research toy?Queue, 6(5):46–58, Sept. 2008.

[4] I. Christadler, G. Erbacci, and A. D. Simpson. Facingthe multicore-challenge ii. chapter Performance andproductivity of new programming languages, pages24–35. Springer-Verlag, Berlin, Heidelberg, 2012.

[5] P. Clauss and B. Meister. Automatic memory layouttransformations to optimize spatial locality inparameterized loop nests. SIGARCH Comput. Archit.News, 28(1):11–19, Mar. 2000.

[6] M. Herlihy, V. Luchangco, M. Moir, and W. N.Scherer, III. Software transactional memory fordynamic-sized data structures. In Proc. of the 22ndannual symp. on Principles of distributed computing,PODC ’03, pages 92–101. ACM, 2003.

[7] A. Ketterlin and P. Clauss. Prediction and tracecompression of data access addresses through nestedloop recognition. In 6th annual IEEE/ACM int. symp.on Code generation and optimization, pages 94–103,Boston, United States, Apr. 2008. ACM.

[8] C. E. Leiserson. The cilk++ concurrency platform. InProceedings of the 46th Annual Design AutomationConference, DAC ’09, pages 522–527, New York, NY,USA, 2009. ACM.

[9] V. Loechner, B. Meister, and P. Clauss. Precise datalocality optimization of nested loops. J. Supercomput.,21(1):37–76, Jan. 2002.

[10] S. Marlow, P. Maier, H.-W. Loidl, M. K. Aswad, andP. Trinder. Seq no more: Better strategies for parallelhaskell. In Proceedings of the 3rd ACM SIGPLANsymposium on Haskell, pages 91–102, Baltimore, MD,United States, Sept. 2010. ACM Press.

[11] M. Odersky, L. Spoon, and B. Venners. Programmingin Scala. Artima Series. Artima Press, 2011.

[12] L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen,J. Ramanujam, P. Sadayappan, and N. Vasilache.Loop transformations: convexity, pruning andoptimization. In Proc. of the 38th annual ACMSIGPLAN-SIGACT symp. on Principles ofprogramming languages, POPL ’11, pages 549–562,New York, NY, USA, 2011. ACM.

[13] K. F. Sagonas. Using static analysis to detect typeerrors and concurrency defects in erlang programs. InFLOPS, pages 13–18, 2010.

[14] B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C.Minh, and B. Hertzberg. Mcrt-stm: a highperformance software transactional memory systemfor a multi-core runtime. In Proc. of the 11th ACMSIGPLAN symp. on Principles and practice of parallelprogramming, PPoPP ’06, pages 187–197, New York,NY, USA, 2006. ACM.

[15] N. Shavit and D. Touitou. Software transactionalmemory. In Proc. of the 14th annual ACM symp. onPrinciples of distributed computing, PODC ’95, pages204–213, New York, NY, USA, 1995. ACM.

[16] M. J. Wolfe. High Performance Compilers for ParallelComputing. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA, USA, 1995.

44

Page 53: ProceedingsofIMPACT2013 - cs.colostate.edu

Facilitate SIMD-Code-Generation in the Polyhedral Modelby Hardware-aware Automatic Code-Transformation

Dustin Feld and Thomas SoddemannFraunhofer SCAI, Schloss Birlinghoven, 53754

Sankt Augustin, Germany[dustin.feld|thomas.soddemann]

@scai.fraunhofer.de

Michael Jünger and Sven MallachInstitut für Informatik, Universität zu Köln,

Weyertal 121, 50931 Köln, Germany[mjuenger|mallach]

@informatik.uni-koeln.de

ABSTRACTAlthough Single Instruction Multiple Data (SIMD) unitsare available in general purpose processors already since the1990s, state-of-the-art compilers are often still not capableto fully exploit them, i.e., they may miss to achieve the bestpossible performance.

We present a new hardware-aware and adaptive loop tilingapproach that is based on polyhedral transformations andexplicitly dedicated to improve on auto-vectorization. It isan extension to the tiling algorithm implemented within thePluTo framework [4, 5]. In its default setting, PluTo usesstatic tile sizes and is already capable to enable the use ofSIMD units but not primarily targeted to optimize it. Weexperimented with different tile sizes and found a strong re-lationship between their choice, cache size parameters andperformance. Based on this, we designed an adaptive pro-cedure that specifically tiles vectorizable loops with dynam-ically calculated sizes. The blocking is automatically fittedto the amount of data read in loop iterations, the availableSIMD units and the cache sizes. The adaptive parts arebuilt upon straightforward calculations that are experimen-tally verified and evaluated. Our results show significant im-provements in the number of instructions vectorized, cachemiss rates and, finally, running times.

Categories and Subject DescriptorsD.3.4 [Programming Languages]: Processors—Code gen-eration, Compilers, Optimization; B.3.2 [Memory Archi-tectures]: Design Styles—Cache Memories; C.1.2 [Proces-sor Architectures]: Multiple Data Stream Architectures(Multiprocessors)—single-instruction-stream, multiple-data-stream processors (SIMD)

General TermsAlgorithms, Performance, Experimentation

KeywordsSIMD, SSE, AVX, TSS, polyhedral model, polyhedron model,tiling, code generation, automatic parallelization, vectoriza-tion, loop optimization, loop transformation

1. INTRODUCTIONSingle Instruction Multiple Data (SIMD) units offer a

speedup-potential brought to a wide range of users. In orderto exploit the available concurrency of modern CPUs, onemust achieve shared memory parallelism by multithreading

and, at the same time, vectorization by effectively applyinginstructions to multiple data. Both tasks impose a difficultchallenge to experienced programmers as well as state-of-the-art compilers. In this paper, we focus on the effectiveautomatic exploitation of SIMD units. We found severe lim-itations when we made some experiments with the vectorizerof the GNU C Compiler (gcc, version 4.6). To our surprise,even for loops that can be vectorized in a straightforwardmanner, SSE-instructions were only set if the range speci-fied by the loop bounds was a multiple of the SIMD registerwidth divided by the size of the data type.

Vectorization is difficult since it usually requires an analy-sis of the dependence structure of the code to be optimized.It demands for the right ordering of instructions and fast ac-cesses to data in order to leverage its full speedup potential.Unfortunately, in many cases even the existence of a legalorder of instructions cannot be easily recognized by humansor compiler procedures. Here, polyhedral code optimizationis a powerful tool to detect and exploit parallelism in loopstructures. It provides a formal characterization of affineloop nests, their iteration spaces, the dependencies betweenstatements and the iteration points in which they occur [2, 7,9, 12, 13]. It can therefore be used to generate valid transfor-mations of a given source code. Further, tiling (or blocking)of nested loops is a well-known technique to improve data lo-cality and, if concurrency concerning outer loop dimensionsis possible, to perform automatic parallelization [21]. In thismanner, transformations such as loop fusion, splitting, skew-ing, or interchange may enable a coarse-grain (tile-wise) or afine-grain (loop-internal) concurrency (or both) even wherethis is not the case for the original source code [2].

There is extensive literature dealing with the optimiza-tion of tilings. However, to the best of our knowledge,there is yet no implemented approach that integrates looptransformations which broadly enable automatic vectoriza-tion with tilings and an adaptive hardware-aware tile sizeselection (TSS). Trifunovic et al. [20] analyze the impactof loop transformations on the resulting possibilities to ap-ply auto-vectorization and performance by means of a costmodel that can be seamlessly integrated into the polyhedralmodel. Based on this, they propose a framework to choosethe best-suited loop for vectorization within a nest. Unfor-tunately, it does not comprise a TSS model. As opposed tothat, there exist several TSS models which are, however, notexplicitly geared towards an improved vectorization. Cole-man and McKinley [8] present an iterative technique to cal-culate cache-fitting tile sizes for all loop dimensions. Thisis interesting in conjunction with the approach presented

45

Page 54: ProceedingsofIMPACT2013 - cs.colostate.edu

in this paper, especially for cases where there is no vec-torizable loop available. This is also true for Sarkar’s andMegiddo’s [17] approach to generate tile sizes via a memory-oriented cost model of the given loop nest and an analyticalmodel to finally perform the TSS. However, it is restricted toloop nests of depth two or three. Shirako et al. [18] present amethod to analytically bound the search space for tile sizesthat lead to a good performance. They potentially leave loopdimensions unblocked and propose to tile a vectorizable loopinto large blocks. While these general ideas are similar toours, their approach is merely capable to perform a one-level tiling and based on an empirical search method whileours uses a straightforward polyhedral and analytical basis.Ghosh et al. [11] propose to use cache miss equations [3] forTSS and in order to detect poor cache performance. Theyshow how loop transformations (including tiling) can helpto improve on this. Abella et al. [1] use the equations to de-termine optimal tile sizes by means of a genetic algorithm.However its running time on some inputs is not applicablefor a common compilation process.

In this paper, we present a new hardware-oriented TSSapproach explicitly targeted to vectorization. It is an adap-tive procedure that specifically tiles vectorizable loops withdynamically calculated sizes. We implemented our approachas an extension to PluTo [4, 5] which is an academic source-to-source compiler framework that performs tilings basedon polyhedral optimization. The resulting code can be com-piled by any C compiler. In particular, we use PluTo toobtain valid transformations and tilings of loops as well asto gather information about those loops that can actuallybe vectorized. However, in contrast to PluTo, we orient thetiling towards an effective use of SIMD units. Instead ofpartitioning the iteration space with respect to all loop di-mensions and into tiles of static size, we restrict the tilingto those loops that are relevant for the data to be processedin a vectorized manner. Further, we dynamically adapt thesize of tiles to the SIMD register width and the cache sizesof the underlying hardware. Ideally, our approach leads toa software pipeline of blocks to be processed by the SIMDunits. We show for two example source codes that we obtainimproved running times by a combination of a measurablywell-performing stream of data through the cache hierarchyand an increased rate of issued SIMD instructions.

2. POLYHEDRAL TRANSFORMATIONS,VECTORIZABLE LOOPS AND TILINGS

A loop qualifies for vectorization if it is innermost withrespect to its nest and parallelizable, i.e., there are no datadependencies between its iterations.

In the polyhedral model, one considers the iteration pointsof a loop nest and the dependencies between them usingZ-polyhedra [9, 12, 13], as depicted in Fig. 1. In this rep-resentation, the index variable of each loop relates to onedimension of the associated polyhedron. A valid transfor-mation corresponds to a change in the order of executionof the iteration points that preserves compliance with thedependencies. This may include the manipulation of loopdimensions (index variables) leading to a deformation of thepolyhedron such that computations can be processed in par-allel with respect to one or more of the dimensions. Byapplying integer programming techniques, PluTo is capableto perform transformations such that the necessary com-

munication across the dimensions of the resulting loops isminimized [6]. This is beneficial for parallelization. If com-munication is not necessary with respect to an outer loopdimension, then a parallelization of the inner loops’ itera-tions is possible. If communication is not necessary withrespect to the innermost loop dimension, then the iterationsof this loop can be processed in parallel, e.g., by vectoriza-tion. Fortunately, PluTo is able to mark loops that qualifyfor vectorization, possibly by interchanging it to become theinnermost one (which we assume from now on to be thecase). Furthermore, PluTo can compute a partitioning ofthe iteration space into tiles. Consider Fig. 1 for an exam-ple where a legal (rectangular) tiling is possible only aftera transformation of the loop nest. The left tiling is illegal,because iteration points between blocks have reciprocal de-pendencies. After manipulating the loop dimensions, it ispossible to apply the tiling depicted in the right image sinceit now allows for an order of execution that respects all de-pendencies. We use the polyhedral representations withinPluTo to obtain such legal loop tilings.

j

i

j′

i′

Figure 1: A polyhedral representation of a nestedloop with its dependence structure and an invalidtiling (left) as well as a valid one (right).

3. CENTRAL IDEASIn order to fully exploit the speedup potential of SIMD

units, it is advantageous to have access to data that is ac-cordingly prefetched into the available cache hierarchy. Howcan we achieve this by a smart tiling?

As already stated, a loop to be vectorized must be (made)innermost. Its index variable imposes a constant offset/stridebetween the memory addresses of data that is to be succes-sively packed into (SIMD) registers while the indices of allother loops remain constant. It can thus be expected that itis beneficial for the successful prefetching of operands, if theinnermost loop is executed for a relatively large number of it-erations before the control flow leaves it and manipulates theother loops’ index variables. Then, the prefetcher should beable to better pre-load future operands while current calcu-lations are served from the cache hierarchy. The prefetchingcauses cache misses that do not significantly influence therunning time but impose cache hits when the operands arereally needed. On the other hand, a large number of itera-tions in the innermost loop may also harm spatial localityin comparison to fewer ones. This is particularly true if theindex variable of the innermost loop does not correspond tothe minor dimension of all accessed arrays (and thereforenot to one-strides in memory).

Clearly, there must be a trade-off between the advantagesand disadvantages of a large blocking of the innermost loop.However, we believe that the blocking of (a) all loops into

46

Page 55: ProceedingsofIMPACT2013 - cs.colostate.edu

(b) constantly sized blocks (like, e.g., the default tiling intoblocks of 32 iterations performed by PluTo) is unlikely toperform well for every application and every system. Thisis especially true for deeply nested loops with many state-ments within the innermost loop. As an example, if d is thedepth of a loop nest, then the innermost statements of atile using PluTo’s default tiling are executed 32d times andthe data accessed by these statements is unlikely to fit intoa cache with increasing d. Nonetheless, this fits quite wellfor ‘typical’ loop nests with depth two or three and today’susual cache sizes. PluTo allows to also enforce a second-leveltiling of the resulting loops into blocks of 8 which alreadyaddresses the cache memory hierarchies of modern proces-sors. Alternatively, the user may specify tile sizes manually.We used this fact to elaborate on our ideas and made exper-iments with different sizes for a specific tiling of one or twoloops only.

3.1 Experiments with manual tile sizesWe consider two test cases both of which are written in C

and have been taken from PluTo’s example suite:

• A standard matrix multiplication (see Fig. 18)

• A correlation matrix algorithm (see Fig. 21)

The environment for the tests and the benchmarks in thesubsequent section comprises the following Intel Xeon pro-cessor and is running Scientific Linux 6.

CPU: IntelR© XeonR© CPU X5650 (2.67 GHz)

L1 / L2 / L3 cache: 32 KB (data) / 256 KB / 12288 KB

SSE version: 4.2 (128 bit registers)

All runs were performed single-threaded using gcc 4.6 oricc 13.0, both with optimization level -O3 on single precisionfloating point data.

3.1.1 Manual one-level tilingTo start, we consider a pure tiling of the vectorized loop

only with manually set tile sizes qL1 = 4, 64 and 256. Wecompare it to the original source code and PluTo, configuredto also generate a one-level tiling of all loops using its defaultsizes (see Fig. 23).

First, we evaluate the resulting matrix multiplication codesfor various choices of (symmetric) matrix sizes N = M = Kwithin a small range for a fine-grain analysis.

995 1000 10050

2

4

6

8

10

N

runningtime

(s)

not vectorized by gcc

995 1000 1005

0.4

0.6

0.8

N

runningtime

(s)

sourcePluTo (def.)qL1 = 4qL1 = 64qL1 = 256

not vectorized by gcc

close-up of chosen measurements

Figure 2: Running times for the one-level-tiled ma-trix multiplication (fine grain) with gcc

As is depicted in Fig. 2, the performance of the original(source) version highly depends on the size of the matrix.For sizes that are multiples of four, the code generated bygcc performs about four times faster. Application of gcc’s

verbose mode confirmed the hypothesis that it is only able toapply auto-vectorization in these cases. Likewise, already avery fine-grain tiling with qL1 = 4 suffices in order to enableauto-vectorization by gcc for any matrix size. IncreasingqL1 to 256 leads to running times that are comparable tothose achieved by the default tiling of PluTo or even faster.The deviant behavior for sizes that are not multiples of fourcan be explained with the ‘remainders’ that result from thetiling in these cases. We experimentally determined a per-fect correlation between the size of the last tiles and the rateof vectorization, i.e., again gcc appears to not vectorize theseremaining loop iterations, especially if their number is odd.We further elaborate on different parameters that influencethe sustained performance in Sect. 4.2.

In Table 1, the relative performance from the fine-grainedsetting is confirmed for a larger interval of register-fitting(multiples of four) and non-register-fitting matrix sizes.

N 256 512 1024 1536 2048

source 0.0096 0.0905 2.3191 7.9956 20.2370PluTo (def.) 0.0063 0.0536 0.4645 1.4801 3.7902

qL1 = 4 0.0202 0.1829 2.1804 8.2643 26.1112qL1 = 64 0.0071 0.0672 0.6092 2.4307 10.0529qL1 = 256 0.0048 0.0454 0.3887 1.6499 5.2734

N 257 513 1025 1537 2049

source 0.0223 0.1956 9.9101 34.0321 82.7667PluTo (def.) 0.0121 0.1076 0.8839 2.9326 7.5070

qL1 = 4 0.0444 0.3702 3.6207 14.8025 38.7883qL1 = 64 0.0113 0.1081 0.9672 3.7944 14.0045qL1 = 256 0.0079 0.0673 0.5736 2.2688 6.4815

Table 1: Running times for the one-level-tiled ma-trix multiplication (coarse grain) with gcc

We apply the same test cases for the correlation matrixalgorithm which has a more complicated code structure. Ascan be seen in Fig. 3 and Table 2, gcc is not able to vec-torize its original version at all. However, again tiling theloops corresponding to the M -matrix-dimension (togetherwith the corresponding code transformation) enables gcc toauto-vectorize the code already when setting qL1 to 4. WithqL1 = 256, the running times are almost always faster thanwith PluTo’s default one-level tiling.

995 1000 10050

1

2

3

4

5

N

runningtime

(s)

not vectorized by gcc

995 1000 10050.3

0.35

0.4

0.45

0.5

N

runningtime

(s)

sourcePluTo (def.)qL1 = 4qL1 = 64qL1 = 256

not vectorized by gcc

close-up of chosen measurements

Figure 3: Running times for the one-level-tiled cor-relation matrix algorithm (fine grain) with gcc

3.1.2 Manual two-level tilingNow, we consider a two-level tiling (of the vectorized loop

and additionally the outermost loop of its nest) again usingmanually set tile sizes qL1 = 4, 64 and 256 for the vector-ized loop and qL2 = 2, 4 and 8 for the outermost one. We

47

Page 56: ProceedingsofIMPACT2013 - cs.colostate.edu

N 256 512 1024 1536 2048

source 0.0114 0.2140 4.8512 22.4826 60.9249PluTo (def.) 0.0078 0.0599 0.4844 1.5089 3.8053

qL1 = 4 0.0221 0.1844 2.0942 8.5166 21.8286qL1 = 64 0.0065 0.0573 0.5204 1.8351 5.7119qL1 = 256 0.0054 0.0404 0.3472 1.2619 3.7619

N 257 513 1025 1537 2049

source 0.0115 0.2152 4.9040 22.4878 61.0156PluTo (def.) 0.0070 0.0533 0.4367 1.3892 3.3829

qL1 = 4 0.0209 0.1745 1.8981 7.5941 20.9514qL1 = 64 0.0065 0.0557 0.5031 1.7586 5.5304qL1 = 256 0.0054 0.0403 0.3430 1.2401 3.7118

Table 2: Running times for the one-level-tiled cor-relation matrix algorithm (coarse grain) with gcc

configured PluTo to also generate a two-level tiling using itsdefault sizes (see Fig. 23). The running times are depictedin Figures 4 and 5 and, for ease of comparison, the one-leveltiling running times are shown dotted in the right graphs.As might have been expected, the additional blocking leadsto shorter running times in all cases.

995 1000 10050

2

4

6

8

10

N

runningtime

(s)

not vectorized by gcc

995 1000 1005

0.4

0.6

0.8

N

runningtime

(s)

sourcePluTo (def.)qL1 = 4, qL2 = 2qL1 = 64, qL2 = 4qL1 = 256, qL2 = 8

not vectorized by gcc

close-up of chosen measurements

Figure 4: Running times for the two-level tiled ma-trix multiplication (fine grain) with gcc

995 1000 10050

1

2

3

4

5

N

runningtime

(s)

not vectorized by gcc

995 1000 10050.25

0.3

0.35

0.4

0.45

0.5

N

runningtime

(s)

sourcePluTo (def.)qL1 = 4, qL2 = 2qL1 = 64, qL2 = 4qL1 = 256, qL2 = 8

not vectorized by gcc

close-up of chosen measurements

Figure 5: Running times for the two-level tiled cor-relation matrix algorithm (fine grain) with gcc

3.2 Ideas towards an automatic derivation oftile sizes

The results presented so far appear to support the hy-pothesis that a tiling of specific loops can be as effectiveas a tiling of all loops. Furthermore, they tell us that loopranges should be SIMD-specific, i.e., fit to SIMD registerwidths, in order to leverage the full potential of gcc’s au-tomatic vectorizer. Until now, the running times seem toimprove with increasing tile sizes. Clearly, the amount ofincrease that pays off must be limited. Our intuition wasthat a natural limitation should stem from the cache sizes.Ideally, the block size for the first level tiling should be fit-ted to the ratio of the size of the L1 cache and the amount

of data read in one iteration of the loop to be vectorized.Similarly, the block size for the second level tiling shouldbe fitted to the ratio between the L2 cache size and the L1cache size. In the best case, this strategy could lead to thesituation that all required data for one tile of the vectorizedloop fits into the L1 cache and that such blocks of data canbe successively pipelined from the L2 cache. With the largerinnermost tile size, we should be able to move compulsorycache misses to the prefetcher and, with the cache-specifictiling, we should optimize for less capacity cache misses atthe same time. Hence, we call the just described combina-tion of both concepts a SIMD- and cache-specific (SICA)tiling. For the (optional) second-level tiling, we select theoutermost loop of the corresponding nest. This strategykeeps changes to the inner loops as rare as possible which,as a heuristic, should be good for the prefetcher.

We set up a straightforward model to compute all the in-formation needed for experiments to analyze whether thisis indeed a promising approach. First of all, we need toknow how many new operands must be loaded by each ofa loops’ iterations. This requires an analysis of the loops’statements, since, e.g., operands (or addresses) that are ac-cessed multiple times should be loaded only once per itera-tion. Furthermore, constant values as well as operands thatdo not depend on the vectorized loop should be loaded evenonly once at all for the entire loop. As already stated inSect. 2, the innermost loop may, in general, contain severalstatements which are considered to compose a statement-block. We calculate the total amount of data elements E toload per iteration for each of these blocks.

Elements per Iteration : ElPeIt = E (1)

Further, we need the following hardware parameters:

• CL1: The size of the L1 cache (in KBytes)

• CL2: The size of the L2 cache (in KBytes)

• R: The SIMD register width (in Bits)

Using this information, the number of elements of thegiven type (e.g. float or double) with size D (in Bytes)that fit into the L1 cache can be calculated as follows:

cache size in elements : CaSiEl =CL1 ∗ 1024

D (2)

The number of operands of the given data type with sizeD (in Bytes) that can be packed into the SIMD registers isdenoted by:

elements per register : ElPeRe =R

8 ∗ D (3)

Since there might be other variables that should be cached,one might like to adjust the ratio of the L1 cache size to useto, e.g., only 90% or 80%. We therefore introduce an ac-cording parameter ρ with the meaning that ρ = 1.0 relatesto 100%.

ratio of cache to use : ρ (4)

For each block of statements we may now compute howmany iterations shall be blocked so that all operands re-quired by this block ideally fit into the L1 cache at once:

iterations to block : ItToBl = ρ ∗ CaSiEl

ElPeIt(5)

48

Page 57: ProceedingsofIMPACT2013 - cs.colostate.edu

Finally, we want to make the first-level tile size qL1 amultiple of the SIMD-register width. Hence, we computethe greatest multiple of ElPeRe that fits into the L1 cacheas follows:

qL1 =

—ItToBl

ElPeRe

�∗ ElPeRe (6)

Summing up all calculations into a single formula yields:

qL1 =

$ρ ∗ CL1∗1024

D ∗ 1E

R8∗D

%∗ R

8 ∗ D (7)

=

—ρ ∗ CL1 ∗ 8192

R ∗ E

�∗ R

8 ∗ D (8)

Now for the second level tiling, we simply calculate theratio of the two cache sizes.

qL2 =CL2

CL1(9)

3.2.1 ExampleConsider the standard matrix multiplication for single pre-

cision floating point data with only one statement C[i][j] =C[i][j] + α ∗A[i][k] ∗B[k][j] and vectorized j-loop. Two newdata elements need to be loaded per j-iteration, namelyC[i][j] and B[k][j]. This leads to the following SICA L1tile size for our test system with 32 KByte of L1 cache and128 Bit SSE registers:

qL1 =

—ρ ∗ 32 ∗ 8192

128 ∗ 2

�∗ 128

8 ∗ 4

ρ=1.0= 4096 (10)

Since our system has 256 KByte of L2 cache, the secondblock size evaluates to qL2 = 256

32= 8.

3.3 SICA extensions to PluToWe implemented our SICA tiling as an extension of PluTo

together with several new parameters and functionalitiesthat cause only neglectable overhead. It comprises adaptivecomponents, like, e.g., procedures to determine hardwareparameters by using the CPUID [15] instructions (they canbe equally manually set via a configuration file) and newroutines to calculate the amount of data loaded in one loopiteration. If an innermost loop contains multiple statements(possibly as a consequence of the polyhedral transforma-tions), it is viable to group them into blocks for which thetiling is then performed independently. Unlike in PluTo’soriginal tiling algorithm, every block of statements can beassociated with individual tile sizes. This is necessary sincedifferent statement blocks (within the same loop nest) mayrequire a different number of operands to be loaded per iter-ation. Nevertheless, one may still request a globally uniformtiling by adopting the minimal or maximal determined tilesize for all statement blocks. As an example, a rectangularSICA tiling of a perfectly nested loop with only one state-ment S is depicted in Fig. 24.

4. ANALYSIS AND BENCHMARKSThe following experiments can be divided into two parts.

First, we deliver a verification of the proposed correlation be-tween the amount of data read within the vectorized loops’iterations, tile sizes, cache sizes and performance. Afterthat, we evaluate the performance of the corresponding adap-tive approach concerning running times, cache miss rates,TLB misses and the rate of issued SIMD instructions.

4.1 Verification of the approachIn order to evaluate the impact of the tile sizes only, we

fix some (asymmetric) matrix sizes. We keep the matrixdimensions corresponding to non-vectorized loop dimensionssmall in order to be able to benchmark a large interval ofsizes for the vectorized one and to obtain reasonable runningtimes at the same time.

4.1.1 SICA L1 tilingWe again start our experiments with a one-level tiling. For

the matrix multiplication, we set M = 189, N = 139233 andK = 189, since the loop corresponding to the N -dimensionis the vectorized one. Then, we vary the tile sizes by suc-cessively changing the cache-ratio parameter ρ from 0 to 10in steps of 0.01 units. This results in 1000 different versionsof the code with tile sizes up to 36864. Fig. 6 shows theircorresponding running times which are all within the shad-owed area while the line is a cubic interpolation of them withsome smoothing applied.

0.25 1 3 5 7 9

2.9

3

3.1

ρ = 1.0

ρ

runningtime

(s)

SICA

(1024)(4096) (12288) (20480) (28672) (36864)

(qL1)

Figure 6: Impact of the tile size on the running timeof the matrix multiplication code with gcc

There is a global optimum that corresponds to about 90%of the L1 cache size. This appears to confirm a correlationbetween performance, tile and cache sizes. Furthermore, weclaimed a relationship between the optimum tile sizes andthe amount of data that needs to be loaded within the vec-torized loops’ iterations. To verify this, we additionally mea-sured the running times of codes performing the addition oftwo or even three matrix multiplications (the correspondingstatements in the nested loop are depicted in Fig. 20). Foreach additional matrix multiplication, there is one additionaloperand to be loaded per iteration. Regarding our test sys-tem, this leads to tile sizes of qL1 = 4096 for a single matrixmultiplication [matmul1 ], qL1 = 2728 for the sum of twoof them [matmul2 ] and qL1 = 2048 for the sum of three ofthem [matmul3 ]. Again, Fig. 7, in which we scaled the run-ning times of the different versions to a common ordinate(each individual range in seconds is denoted in brackets),shows that the best tile size is at about 80 to 90% of thetheoretical optimum.

As before, the same experiments are repeated for the cor-relation matrix algorithm. Here, we set M = 11923 andN = 89 since loops corresponding to the M -dimension arevectorized. Since there are several statements and multi-ple loop nests (see Fig. 21 for the original code and Fig. 22

49

Page 58: ProceedingsofIMPACT2013 - cs.colostate.edu

ρ = 1.0ρ = 1.0ρ = 1.0

1024 2048 3072 4096 5120 6144qL1

runningtime

(relativeins)

matmul1 (2.85− 2.95)

matmul2 (4.45− 4.65)

matmul3 (6.34− 6.54)

Figure 7: The effect of more data to be loaded ineach iteration of the vectorized loop with gcc

for the SICA version), there is no unique tile size but anindividual one for each statement block. This is why it isreasonable to measure the running times only by varying ρand thereby scaling the tile sizes proportionally.

In Fig. 8, the best tile size turns out to be quite exactlythe theoretical optimum. Another interesting observationare the local optima for ρ = 2.0 and ρ = 3.0, where thenecessary data could be loaded from the cache in exactlytwo or three portions.

0.25 1 2 3 4 5 6 7 8

4.25

4.26

4.27ρ = 1.0

ρ

runningtime

(s)

SICA

Figure 8: Impact of the tile size on the running timeof the correlation matrix algorithm with gcc

4.1.2 SICA L2 tilingFor a two-level tiling, we already theoretically justified to

choose the outermost loop from the nest of the vectorizableone. Now, we deliver an experimental justification of thisdecision using the matrix multiplication example. The re-sulting running times for each selection of one of the threenested loops for the second-level tiling (while tiling the firstlevel with the calculated qL1) are depicted in Fig. 9. Onlywhen the outermost loop (first) is selected, the running timeis improved. Furthermore, the calculated theoretical opti-mum of qL2 = 8 (the L2 cache on the test-system is 8 timeslarger than the L1 cache) leads to the best running times.

4.2 Performance countersTo further investigate the impact of the SICA tiling and to

explain the improved running times, we measured PAPI [19]performance counters for the original source code and the

2 4 6 8 102.4

2.6

2.8

3

3.2

qL2

runningtime

(s)

matmul (first)matmul (second)matmul (third)

Figure 9: Second level tiling: The impact of thechoice of a distinct loop and its tile size (with gcc)

two-level tilings by PluTo (default) and our extension withthe asymmetric matrix sizes from Sect. 4.1. In particular,we focused on the effects concerning the L2 cache miss rate(as each L2 cache miss is preceded by an L1 cache missand the L2 cache, being the last on-core level, is crucial forour intended pipeline), the rate of introduced SIMD instruc-tions and the number of L1/L2-TLB misses. Concerning thecache behavior, it is important to consider miss rates insteadof absolute numbers of cache misses. This is true since asuccessful prefetching of the accessed data may lead to anincrease of the total number of cache misses and accesses atthe same time. However, misses obtained like this do notsignificantly harm performance but impose cache hits whenthe operands are really needed. We additionally measuredthe cache hit rate but did not explicitly picture it. As couldbe expected, it sums up to 100% with the cache miss rate(besides minor deviation of the counters).

When using gcc, both the SICA and PluTo’s default tilinglargely improve the performance of the matrix multiplicationcode (cf. Fig. 10). While the original source code cannot bevectorized by gcc at all, the tiled versions enable the intro-duction of SIMD instructions. In case of the SICA tiling,in fact nearly all instructions are SIMD instructions. Tilingonly two instead of all loop dimensions results in fewer caseswhere the control flows enters the innermost loop with ‘re-mainders’ of iterations that cannot be vectorized by gcc.This effect is in fact intensified by the choice of asymmetricmatrix sizes for these experiments. The larger tile size forthe vectorized loop (in comparison to PluTo) corresponds tomany predictable accesses to the innermost (linearly stored)array dimension. Both tiled versions lead to a reduction ofthe L2 cache miss rate. This is especially true with the fittedtile sizes calculated by the SICA extension which also leadsto a stronger reduction of TLB misses.

Fig. 11 shows that the internal optimizations done withinicc applied to the original source code perform better thanwhen applied to PluTo’s default tiled version. This is espe-cially true for the L2 cache miss rate. It is even marginallybetter for the original source than with the fitted SICA tilesizes. However, with the SICA tiling, icc is able to pro-duce the fastest code since it can again turn nearly everyinstruction into a SIMD instruction. The situation concern-ing TLB misses is as before with the exception that the codeproduced by icc on the original source code leads to far lessTLB misses than with gcc.

In case of the correlation matrix algorithm and gcc, the

50

Page 59: ProceedingsofIMPACT2013 - cs.colostate.edu

3

6

9

12

15

18

21

24

27

runningtime (s)

24.97 s

source

3.69 s

PluTo

(defau

lt)

2.52 s

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L2 cachemiss rate (%)

35.76%

source

26.41%

PluTo

(defau

lt)

4.78%

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

vectorizedinstructions (%)

0.00%

source

71.53%

PluTo

(defau

lt)

99.69%

SICA

101

102

103

104

105

106

107

108

109

1010

TLBMisses

1345541255

source

3111456

PluTo

(defau

lt)

69055

SICA

Figure 10: PAPI performance counters for the matrix multiplication code with gcc

0.33

0.67

1

1.33

1.67

2

2.33

2.67

3

3.33

runningtime (s)

1.99 s

source

3.03 s

PluTo

(defau

lt)

1.73 s

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L2 cachemiss rate (%)

4.02%

source

25.34%

PluTo

(defau

lt)

5.75%

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

vectorizedinstructions (%)

83.66%

source

71.28%

PluTo

(defau

lt)

99.68%

SICA

101

102

103

104

105

106

107

TLBMisses

2676671

source

2316576

PluTo

(defau

lt)

65613

SICA

Figure 11: PAPI performance counters for the matrix multiplication code with icc

1

2

3

4

5

6

7

8

9

runningtime (s)

8.22 s

source

5.81 s

PluTo

(defau

lt)

3.41 s

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L2 cachemiss rate (%)

19.17%

source

33.87%

PluTo

(defau

lt)

5.82%

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

vectorizedinstructions (%)

0.01%

source

99.57%

PluTo

(defau

lt)

99.67%

SICA

101

102

103

104

105

106

107

TLBMisses

1595361

source

20991

PluTo

(defau

lt)

1730977

SICA

Figure 12: PAPI performance counters for the correlation matrix code with gcc

0.33

0.67

1

1.33

1.67

2

2.33

2.67

3

3.33

runningtime (s)

2.81 s

source

3.12 s

PluTo

(defau

lt)

2.25 s

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L2 cachemiss rate (%)

22.42%

source

38.12%

PluTo

(defau

lt)

5.85%

SICA

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

vectorizedinstructions (%)

99.90%

source

99.66%

PluTo

(defau

lt)

99.90%

SICA

101

102

103

104

105

106

107

TLBMisses

2482883

source

20473

PluTo

(defau

lt)

1859533

SICA

Figure 13: PAPI performance counters for the correlation matrix code with icc

51

Page 60: ProceedingsofIMPACT2013 - cs.colostate.edu

source codes produced by PluTo and with the SICA tilingboth improve the running time as is shown in Fig. 12. Evenmore, both tilings lead to a nearly perfect vectorization ofthe code. However, whereas the PluTo code leads to anincrease of L2 cache misses compared to the original sourcecode, the SICA tiling leads to a decrease which explainsthe better running time. As opposed to that and with lessimpact on the running time, PluTo performs by far bestconcerning TLB misses while the SICA version cannot evenimprove on the number of TLB misses that are producedwith the original source code.

As with the matrix multiplication, PluTo’s default tilingdoes not lead to a better running time compared to theoriginal source code when compiled with icc. For the L2cache miss rate and the TLB misses, the results are similarto the case of using gcc. However, icc is able to fully vectorizethe original code and there is nearly no difference in the rateof vectorization using any of the three codes as input.

4.2.1 Interim summaryFor the considered test cases and compared to the original

source code and PluTo’s default tiling, the SICA approachalways produced the best running times, no matter if usedas input for gcc or icc. A big advantage of the SICA tiling isthat it typically enables the compilers to vectorize a largernumber of instructions. Further, across all benchmarks, itleads to a small L2 cache miss rate which is the best oneobtained except for the matrix multiplication with icc. Con-cerning TLB misses, possible improvements depend on theaccess pattern of the statements within the vectorized loop.

4.3 Performance benchmarksFinally, we would like to verify the performance of the

SICA two-level tiling for the two application codes across alarger range of symmetric matrix input sizes. We selectedthe sizes i · 1024 for i ∈ [2, 8]. Since these are all multiplesof four, they diminish the disadvantages of PluTo’s defaulttiling in the benchmarks before, i.e., there will be no non-vectorizable ‘remainders’ of loop iterations anymore. Simi-larly, gcc will always be able to apply its auto-vectorizationalready to the original source code.

Fig. 14 and 16 show the corresponding running times forgcc and Fig. 15 and 17 those for icc. The output producedby the SICA tiling appears to support both compilers toproduce faster code compared to the original source andPluTo’s default tiling. For completeness, Table 3 depictsthe average speedups obtained for the tested input sizes.

gcc matrix multiplication correlation matrix

PluTo (def.) 11.14 4.47SICA 20.05 8.89

icc matrix multiplication correlation matrix

PluTo (def.) 1.01 3.73SICA 1.31 7.54

Table 3: Average speedups (coarse grain)

5. CONCLUSION, ONGOING WORK ANDOUTLOOK

We presented an adaptive hardware-aware tiling approachthat has been implemented and evaluated as an extension to

2048 4096 6144 81920

2000

4000

6000

8000

N

runningtime

(s)

2048 4096 6144 81920

200

400

600

N

runningtime

(s)

sourcePluTo (def.)

SICA

close-up of chosen measurements

Figure 14: Running times for the two-level tiled ma-trix multiplication (coarse grain) with gcc

2048 4096 6144 81920

100

200

300

N

runningtime

(s)

2048 4096 6144 81920

100

200

300

N

runningtime

(s)

sourcePluTo (def.)

SICA

close-up of chosen measurements

Figure 15: Running times for the two-level tiled ma-trix multiplication (coarse grain) with icc

2048 4096 6144 81920

500

1000

N

runningtime

(s)

2048 4096 6144 81920

100

200

300

Nrunningtime

(s)

sourcePluTo (def.)

SICA

close-up of chosen measurements

Figure 16: Running times for the two-level tiled cor-relation matrix (coarse grain) with gcc

2048 4096 6144 81920

200

400

600

N

runningtime

(s)

2048 4096 6144 81920

50

100

150

200

N

runningtime

(s)

sourcePluTo (def.)

SICA

close-up of chosen measurements

Figure 17: Running times for the two-level tiled cor-relation matrix (coarse grain) with icc

PluTo. In contrast to PluTo, it does not perform a tiling ofall loops in a nest, but specific tilings of distinct loops thatare beneficial for an effective vectorization. It is adaptivein that it performs an automatic analysis of the amount ofdata accessed by a loop’s statements and is able to deriveinformation about the underlying hardware such as cachesizes and SIMD register widths. These parameters are usedto derive dynamic tile sizes that ideally lead to a softwarepipeline of blocks to be processed by the SIMD units and tobe well prefetched through the cache hierarchy.

We verified and evaluated our approach experimentallyon two source codes from PluTo’s example suite, namely astandard matrix multiplication and a correlation matrix al-gorithm. As our results show, the proposed tiling strategy

52

Page 61: ProceedingsofIMPACT2013 - cs.colostate.edu

leads to better running times in comparison to PluTo’s de-fault tiling and the original source code when its output iscompiled with gcc 4.6 or icc 13.0. An analysis with per-formance counters brought to light that these results can bemainly explained by an increase of issued SIMD instructionsand a strong reduction of the L2 cache miss rate.

In further studies (see [10]), we examined the behavior ofour extension on further source codes. They include appli-cations that contain very deep loop nests and where a staticall-dimension tiling therefore leads to blocks that are by fartoo large for today’s usual cache sizes. This may consider-ably harm performance, whereas our extension can handlethese cases by its dynamic analysis and the restriction to tileat most two loops.

However, we do not consider a tiling of at most two loopsas a globally superior strategy. To the contrary, if the de-pendence structure of a statement block refers to multipleloop dimensions, a corresponding specific multi-dimensionaltiling could be superior in terms of spatial locality and TLBperformance. We plan to consider this in future work. Sim-ilarly, we would like to manipulate the loop transformationstowards an automatic optimization for one-strided memoryaccesses. This could be potentially achieved by (a) sepa-rating the statements of a block according to their accesspatterns, (b) transforming them one by one targeting one-strided accesses and (c) vectorizing the resulting loop nests.We also plan to experiment with a L3 cache tiling as well aswith the combination of our developments with automaticshared-memory parallelization in order to exploit the fullconcurrency potential of modern multicore processors.

Currently, our developments are ported to the PoCC [16]framework (that includes PluTo) in order to integrate ourextensions into Polly and thereby into the LLVM infrastruc-ture [14].

6. ACKNOWLEDGMENTSThanks to Uday Bondhugula for the development of the

PluTo framework.

7. REFERENCES[1] J. Abella, A. Gonzalez, J. Llosa, and X. Vera.

Near-optimal loop tiling by means of cache missequations and genetic algorithms. In Proc. Int.Parallel Processing Workshop, pages 568 – 577, 2002.

[2] U. K. Banerjee. Loop Parallelization. Looptransformations for restructuring compilers. KluwerAcademic Publishers, Norwell, MA, USA, 1994.

[3] D. Bernstein, D. Cohen, and A. Freund. Compilertechniques for data prefetching on the PowerPC. InProc. IFIP WG10.3 working Conf. on ParallelArchitectures and Compilation Techniques, PACT ’95,pages 19–26, Manchester, UK, UK, 1995. IFIPWorking Group on Algol.

[4] U. Bondhugula. PluTo: An automatic parallelizer andlocality optimizer for multicores.

[5] U. Bondhugula. Effective Automatic parallelizationand locality optimization using the polyhedral model.PhD thesis, Columbus, OH, USA, 2008.

[6] U. Bondhugula, A. Hartono, J. Ramanujam, andP. Sadayappan. A practical automatic polyhedralparallelizer and locality optimizer. In Proc. 2008 ACMSIGPLAN Conf. on Programming Language Design

and Implementation, PLDI ’08, pages 101–113, NewYork, NY, USA, 2008. ACM.

[7] A. Cohen, M. Sigler, D. Parello, S. Girbal, O. Temam,and N. Vasilache. Facilitating the search forcompositions of program transformations. In ACMInt. Conf. on Supercomputing (ICS’05), pages151–160, 2005.

[8] S. Coleman and K. S. McKinley. Tile size selectionusing cache organization and data layout. In Proc.ACM SIGPLAN Conf. on Programming LanguageDesign and Implementation, PLDI ’95, pages 279–290,New York, NY, USA, 1995. ACM.

[9] P. Feautrier. Some efficient solutions to the affinescheduling problem: I. one-dimensional time. Int. J.Parallel Program. volume 21, pages 313–348, 1992.

[10] D. Feld. Effiziente Vektorisierung durchsemi-automatisierte Code-Optimierung imPolyedermodell, available at http://publica.fraunhofer.de/documents/N-194546.html, 2011.

[11] S. Ghosh, M. Martonosi, and S. Malik. Automatedcache optimizations using CME driven diagnosis. InProc. 14th Int. Conf. on Supercomputing, ICS ’00,pages 316–326, New York, NY, USA, 2000. ACM.

[12] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen,D. Parello, M. Sigler, and O. Temam. Semi-automaticcomposition of loop transformations for deepparallelism and memory hierarchies. Int. J. ParallelProgram. volume 34, pages 261–317, 2006.

[13] M. Griebl. Automatic parallelization of loop programsfor distributed memory architectures, 2004.

[14] T. Grosser, H. Zheng, R. A, A. Simburger,A. Grosslinger, and L.-N. Pouchet. Polly - polyhedraloptimization in LLVM. In First Int. Workshop onPolyhedral Compilation Techniques (IMPACT’11),Chamonix, France, April 2011.

[15] Intel. Intel processor identification and the CPUIDinstruction. Technical report, Intel Corporation, 2011.

[16] L.-N. Pouchet. PoCC: the Polyhedral CompilerCollection.

[17] V. Sarkar and N. Megiddo. An analytical model forloop tiling and its solution. In Proc. IEEE Int. Symp.on Performance Analysis of Systems and Software,ISPASS ’00, pages 146–153, Washington, DC, USA,2000. IEEE Computer Society.

[18] J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet,J. Ramanujam, P. Sadayappan, and V. Sarkar.Analytical bounds for optimal tile size selection. InProc. 21st Int. Conf. on Compiler Construction,CC’12, pages 101–121, Berlin, Heidelberg, 2012.Springer.

[19] D. Terpstra, H. Jagode, H. You, and J. Dongarra.Collecting performance data with PAPI-C. 2009.

[20] K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, andI. Rosen. Polyhedral-model guided loop-nestauto-vectorization. In Proc. 18th Int. Conf. onParallel Architectures and Compilation Techniques,PACT ’09, pages 327–337, Washington, DC, USA,2009. IEEE Computer Society.

[21] J. Xue. Loop tiling for parallelism. Kluwer Int. Seriesin Engineering and Computer Science. KluwerAcademic, 2000.

53

Page 62: ProceedingsofIMPACT2013 - cs.colostate.edu

APPENDIXA. STATIC CONTROL PARTS (SCoPs) OF

THE CONSIDERED EXAMPLE CODES

f o r ( i =0; i<M; i++)f o r ( j =0; j<N; j++)f o r (k=0; k<K; k++)C[ i ] [ j ] = C[ i ] [ j ] + alpha∗A[ i ] [ k ]∗B[ k ] [ j ] ;

Figure 18: Original standard matrix multiplication

f o r ( t1=0; t1<=f l o o r (M−1 ,8) ; t1++) {f o r ( t2=0; t2<=f l o o r (N−1 ,3684); t2++) {f o r ( t3=0; t3<=K−1; t3++) {f o r ( t4=8∗t1 ; t4<=min(M−1 ,8∗ t1 +7); t4++) {{lbv=3684∗ t2 ; ubv=min(N−1 ,3684∗ t2 +3683);#pragma ivdep#pragma vector alwaysf o r ( t9=lbv ; t9<=ubv ; t9++) {C[ t4 ] [ t9 ]=C[ t4 ] [ t9 ]+alpha∗A[ t4 ] [ t3 ]∗B[ t3 ] [ t9 ] ; ;}}

}}}}Figure 19: Matrix multiplication (SICA applied withρ = 0.9)

C[ i ] [ j ]=C[ i ] [ j ]+A[ i ] [ k ]∗B[ k ] [ j ] ;C[ i ] [ j ]=C[ i ] [ j ]+A[ i ] [ k ]∗B[ k ] [ j ]+D[ i ] [ k ]∗E[ k ] [ j ] ;C[ i ] [ j ]=C[ i ] [ j ]+A[ i ] [ k ]∗B[ k ] [ j ]+D[ i ] [ k ]∗E[ k ] [ j ]

+F [ i ] [ k ]∗G[ k ] [ j ] ;

Figure 20: The different statements in matmul1,matmul2 and matmul3

/∗ Center and reduce the column vec to r s . ∗/f o r ( i = 1 ; i <= N; i++)f o r ( j = 1 ; j <= M; j++) {data2 [ i ] [ j ] −= mean [ j ] ;data2 [ i ] [ j ] /= sq r t (N) ∗ stddev [ j ] ;}

/∗ Calcu la te the M ∗ M co r r e l a t i o n matrix . ∗/f o r ( j 1 = 1 ; j 1 <= M−1; j 1++) {symmat [ j 1 ] [ j 1 ] = 1 . 0 ;f o r ( j 2 = j1 +1; j2 <= M; j2++) {symmat [ j 1 ] [ j 2 ] = 0 . 0 ;f o r ( i = 1 ; i <= N; i++)symmat [ j 1 ] [ j 2 ]+=(data2 [ i ] [ j 1 ]∗ data2 [ i ] [ j 2 ] ) ;

symmat [ j 2 ] [ j 1 ] = symmat [ j 1 ] [ j 2 ] ;}}

Figure 21: Original correlation matrix SCoP

f o r ( t2=0; t2<=f l o o r (M−1 ,8) ; t2++) {f o r ( t3=c e i l ( t2 −920 ,921); t3<=f l o o r (M, 7 3 6 8 ) ; t3++) {f o r ( t5=max(1 ,8∗ t2 ) ; t5<=min(min (M−1 ,8∗ t2+7) ,

7368∗ t3 +7366); t5++) {{lbv=max(7368∗ t3 , t5 +1); ubv=min(M,7368∗ t3 +7367);#pragma ivdep#pragma vector alwaysf o r ( t10=lbv ; t10<=ubv ; t10++) {symmat [ t5 ] [ t10 ]=0 . 0 ; ;}}

}}}f o r ( t2=1; t2<=M−1; t2++) {symmat [ t2 ] [ t2 ]= 1 . 0 ; ;}f o r ( t2=0; t2<=f l o o r (N, 8 ) ; t2++) {f o r ( t3=0; t3<=f l o o r (M, 2 4 5 6 ) ; t3++) {f o r ( t5=max(1 ,8∗ t2 ) ; t5<=min(N,8∗ t2 +7); t5++) {{lbv=max(1 ,2456∗ t3 ) ; ubv=min(M,2456∗ t3 +2455);#pragma ivdep#pragma vector alwaysf o r ( t10=lbv ; t10<=ubv ; t10++) {data [ t5 ] [ t10]−=mean [ t10 ] ; ;data [ t5 ] [ t10 ]/= sq r t (N)∗ stddev [ t10 ] ; ;}}

}}}f o r ( t2=0; t2<=f l o o r (M−1 ,8) ; t2++) {f o r ( t3=c e i l (2∗ t2 −920 ,921); t3<=f l o o r (M, 3 6 8 4 ) ; t3++) {f o r ( t4=1; t4<=N; t4++) {f o r ( t5=max(1 ,8∗ t2 ) ; t5<=min(min (M−1 ,8∗ t2+7) ,

3684∗ t3 +3682); t5++) {{lbv=max(3684∗ t3 , t5 +1); ubv=min(M,3684∗ t3 +3683);#pragma ivdep#pragma vector alwaysf o r ( t10=lbv ; t10<=ubv ; t10++) {symmat [ t5 ] [ t10 ]+=(data [ t4 ] [ t5 ]∗ data [ t4 ] [ t10 ] ) ; ;}}

}}}}f o r ( t2=0; t2<=f l o o r (M−1 ,8) ; t2++) {f o r ( t3=c e i l (2∗ t2 −920 ,921); t3<=f l o o r (M, 3 6 8 4 ) ; t3++) {f o r ( t5=max(1 ,8∗ t2 ) ; t5<=min(min (M−1 ,8∗ t2+7) ,

3684∗ t3 +3682); t5++) {{lbv=max(3684∗ t3 , t5 +1); ubv=min(M,3684∗ t3 +3683);#pragma ivdep#pragma vector alwaysf o r ( t10=lbv ; t10<=ubv ; t10++) {symmat [ t10 ] [ t5 ]=symmat [ t5 ] [ t10 ] ; ;}}

}}}Figure 22: Correlation matrix SCoP (SICA appliedwith ρ = 0.9)

f o r ( i =0; i<M; i++)f o r ( j =0; j<N; j++)S ( i , j ) ;

f o r ( i i =0; i i≤ f l o o r (M−1 ,32); i i ++)f o r ( j j =0; j j≤ f l o o r (N−1 ,32); j j++)f o r ( i =32∗ i i ; i≤min(M−1 ,32∗ i i +31); i++)f o r ( j=32∗ j j ; j≤min(N−1 ,32∗ j j +31); j++)S ( i , j ) ;

f o r ( i i i =0; i i i ≤ f l o o r (M−1 ,256); i i i ++)f o r ( j j j =0; j j j≤ f l o o r (N−1 ,256); j j j ++)f o r ( i i =8∗ i i i ; i i≤min( f l o o r (M−1 ,32) ,8∗ i i i +7); i i ++)f o r ( j j =8∗ j j j ; j j≤min( f l o o r (N−1 ,32) ,8∗ j j j +7); j j++)f o r ( i =32∗ i i ; i≤min(M−1 ,32∗ i i +31); i++)f o r ( j=32∗ j j ; j≤min(K−1 ,32∗ j j +31); j++)S ( i , j ) ;

Figure 23: Perfectly nested loop (left) with one level (middle) and two level (right) traditional tiling by PluTo

f o r ( i =0; i<M; i++)

f o r ( j j =0; j j≤ f l o o r (N−1,qL1 ) ; j j++)

f o r ( j=qL1∗ j j ; j≤min(N−1,qL1∗ j j +(qL1 − 1 ) ) ; j++)S ( i , j ) ;

for ( i i =0; i i≤ f l o o r (M−1,qL2 ) ; i i ++)

for ( j j =0; j j≤ f l o o r (N−1,qL1 ) ; j j++)

for ( i=qL2∗ i i ; i≤min(M−1,qL2∗ i i +(qL2 − 1 ) ) ; i++)

for ( j=qL1∗ j j ; j≤min(N−1,qL1∗ j j +(qL1 − 1 ) ) ; j++)S ( i , j ) ;

Figure 24: Loop from figure 23 with one level (left) and two level (right) SICA tiling for vectorizable j loop

54

Page 63: ProceedingsofIMPACT2013 - cs.colostate.edu

SPolly: Speculative Optimizations in the Polyhedral Model

Johannes DoerfertSaarbrücken Graduate School

of Computer ScienceSaarland University

Saarbrücken, [email protected]

saarland.de

Clemens HammacherSaarbrücken Graduate School

of Computer ScienceSaarland University

Saarbrücken, [email protected]

saarland.de

Kevin StreitSaarbrücken Graduate School

of Computer ScienceSaarland University

Saarbrücken, [email protected]

Sebastian HackComputer Science

DepartmentSaarland University

Saarbrücken, [email protected]

ABSTRACTThe polyhedral model is only applicable to code regionsthat form static control parts (SCoPs) or slight extensionsthereof. To apply polyhedral techniques to a piece of code,the compiler usually checks, by static analysis, whether allSCoP conditions are fulfilled. However, in many codes, thecompiler fails to verify that this is the case. In this pa-per we investigate the rejection causes as reported by Polly,the polyhedral optimizer of a state-of-the-art compiler. Weshow that many rejections follow from the conservative over-approximation of the employed static analyses. In SPolly, aspeculative extension of Polly, we employ the knowledge ofruntime features to supersede this overapproximation. Allspeculatively generated variants form valid SCoPs and areoptimizable by the facilities of Polly. Our evaluation showsthat SPolly is able to effectively widen the applicability ofpolyhedral optimization. On the SPEC 2000 suite, the num-ber of optimizable code regions is increased by 131 percent.In 10 out of the 31 benchmarks of the PolyBench suite,SPolly achieves speedups of up to 11-fold as compared toplain Polly.

Categories and Subject DescriptorsD.3.4 [Programming Languages]: Processors—Code gen-eration, Compilers, Optimization, Retargetable compilers,Run-time environments

General TermsPerformance

KeywordsAdaptive optimization, polyhedral loop optimization, just-in-time compilation, speculative execution

1. INTRODUCTIONThe polyhedral model uses polyhedra to abstract essen-

tial information of a static control program (SCoP, a re-stricted form of nested DO loops). By abstracting loopnests as polyhedra, many popular loop optimizations can

be elegantly formulated in terms of linear transformations.However, a piece of code has to meet the SCoP criteria [5](or slight extensions thereof) to be tractable by polyhedraltechniques. Those conditions are usually met by many nu-merical kernels from linear algebra, image processing, signalprocessing, etc. However, in more general benchmarks suchas the SPEC 2000 benchmark suite, the SCoP conditionsare often violated. One simple reason is, for example, thatthe compiler could not prove that all arrays used in the loopnest do not overlap or alias. Further reasons we encounteredare: The implementation of functions called in the loop nestwas not available because they are extern to the transla-tion unit in question. However, at runtime, the functionsturned out to be pure. Also, index expressions and/or loopbounds could not be classified as affine. Often, some pa-rameter, which does not change in the loop nest, makes anexpression non-affine. Hence, specializing the loop nest tothe concrete value at runtime makes polyhedral techniquesapplicable.

In this paper, we want to advance the applicability of poly-hedral optimizations by incorporating runtime information.For example, parameters that lead to non-affine expressionscan be replaced by their concrete value. This way, using ajust-in-time compiler, the program can prepare specially op-timized versions for different parameter sets. In summary,this paper makes the following contributions:

• We investigate the applicability of Polly [8], a poly-hedral framework for LLVM [10], on the SPEC 2000benchmark suite and classify the causes that preventthe application of Polly to a candidate code region1.We conclude that up to 2.4 times more code regionsin SPEC 2000 could be considered by Polly if runtimeinformation is available.

• Based on those insights, we present SPolly, a prototyp-ical extension of Polly to integrate runtime informa-tion. At the moment, SPolly handles two cases: First,it adds code to the program that checks whether all

1As code regions we consider all non-trivial regions accord-ing to Grosser et al. [8], i.e. single entry, single exit regionscontaining at least one loop.

55

Page 64: ProceedingsofIMPACT2013 - cs.colostate.edu

referenced arrays of a loop nest do not overlap. Sec-ond, it uses profiling to record the values of parametersthat lead to non-affine index expressions and/or loopbounds and generates specialized variants that can beselected during runtime. We demonstrate the benefitof these techniques on the PolyBench benchmark suite.

The rest of this paper is organized as follows: In Section 2we present related work. Section 3 provides an evaluation ofthe most frequent rejection causes of polyhedral transforma-tions; Section 4 explains how SPolly alleviates some of thosecauses. Finally Section 5 evaluates SPolly and Section 6concludes.

2. RELATED WORKThe polyhedral literature is certainly too extensive to be

discussed in full detail here. Furthermore, this paper is con-cerned with increasing the applicability of the polyhedralmodel by speculation and runtime analysis and not withinventing new polyhedral techniques. Therefore, we concen-trate on the recent work which is directly related to ours.

In his thesis [7], Grosser describes a speedup of up to8x for a matrix multiplication benchmark, achieved by hispolyhedral optimizer Polly [8]. He also produced similar re-sults for other benchmarks of the PolyBench [11] benchmarksuite. Other publications [4, 1, 12] show similar results onPolyBench. PolyBench is certainly well suited to comparepolyhedral techniques. However, it does not allow for as-sessing their width of applicability as we do in this paper.

Baghdadi et al. [1] reveal a huge potential for speculativeloop optimizations as an extension to the formerly describedtechniques. They state that aggressive loop nest optimiza-tions (including parallel execution) are profitable and possi-ble, even though overestimated data and flow dependenceswould statically prevent them. Their manually crafted testsalso show the impact of different kinds of conflict manage-ment. Overhead, and therefore speedup, differs from loop toloop, as the applicability of such conflict management sys-tems does, but a trend was observable. The best conflictmanagement system has to be determined per loop and perinput, but all can provide speedup, even if they are not thatwell suited for the situation.

Bastoul et al. [2, 3] and Girbal et al. [6] also perform anevaluation of the applicability of the polyhedral model inseveral benchmarks. In contrast to this work however, heis more concerned with the structure and size of the SCoPsand not with the reasons of why other code regions couldnot be classified as SCoPs.

Jimborean et al. [9] employ the polyhedral model in thecontext of speculative parallelization. A static paralleliza-tion pattern is generated assuming linearity of loop boundsand memory accesses. Additionally, polyhedral transforma-tions are proposed based on static dependence analysis. Atruntime, the linearity of memory accesses and the applica-bility of the proposed candidates is assessed by collectingprofiling information. In case of success, the statically gen-erated pattern is instantiated with the gathered informa-tion and corresponding code is generated. As the performedtransformations are speculative and not validated before ex-ecution of the loop, a speculation mechanism guards theexecution. In case of detected conflicts, rollbacks are per-formed and execution switches to the safe, sequential ver-sion. The approach has some drawbacks: First of all, it

relies on programmer annotations to identify parallelizationcandidates. Second, polyhedral optimization is limited totransformations that do not change the loop structure or re-order statements. The use of synthetic benchmarks in theirevaluation does not allow judging in how far the approachcan profitably extend the applicability of polyhedral opti-mization on widely used benchmarks such as the PolyBenchsuite.

3. MOTIVATIONIn order to analyze which of the SCoP restrictions limit

the applicability of the polyhedral model most, we con-ducted an evaluation on nine programs from the SPEC 2000benchmark suite2. To this end we instrumented Polly to out-put all rejection causes that prevent a candidate code regionfrom being considered a polyhedron3. In case of multiple re-jection causes for one region, we record all of them.

Figure 1 shows the results of this evaluation. For eachrejection cause in Polly, we report three numbers:

• The number of regions where the violation of condi-tion i is a cause for not considering (Column A).

• The number of regions where the violation of condi-tion i is the only rejection cause (Column B).

• The number of regions gained when ignoring all con-ditions 0 to i (Column C).

Over the nine tested programs, 1862 candidate regions aretested, out of which 1587 are rejected by Polly. The remain-ing 275 regions (14.8%) are considered legal SCoPs.

i Rejection cause A B C

0 Non-affine expressions 1230 84 841 Aliasing 1093 207 5102 Non-affine loop bounds 840 6 6603 Function call 532 72 9284 Non-canonical indvars 384 0 11745 Complex CFG 253 31 13876 Unsigned comparison 199 0 15867 Others 1 0 1587

Figure 1: SCoP rejection causes on the SPEC 2000 bench-mark suite

The rejection causes are ordered by the number of re-gions that they affect (column A). In the following, we willexamine the conditions constituting the rejection causes infurther detail and discuss if and how SPolly can alleviatetheir impact.

Non-affine expressions As explained in more detail inSection 4.2, all expressions used for memory accessesor predicates of conditional branches have to be affineexpressions with respect to parameters and inductionvariables. If this is not the case, the code cannotbe represented as polyhedron, thus preventing corre-sponding optimizations. This happens for example

2As SPolly’s runtime environment is based on the Sam-bamba framework [13] we selected the programs that werecontained in the Sambamba test suite: ammp, art, bzip2,crafty, equake, gzip, mcf, mesa, and twolf3All programs were compiled with-sink -indvars -mem2reg -polly-prepare

56

Page 65: ProceedingsofIMPACT2013 - cs.colostate.edu

when a programmer chose to represent a 2-dimensionalarray by a 1-dimensional, flattened, one, translatingthe index (i, j) to i ∗ N + j. In case i and j are iter-ation variables, and N is a parameter, this expressionis not affine but quadratic. If however N can be de-tected as quasi-constant via profiling, it can specula-tively be replaced by that constant to circumvent thenon-affinity.

Aliasing Possible aliasing causes a region to be rejectedwhenever the base address of two memory accessescannot be proven to be disjoint. In particular, thisis the case when pointers originate in parameter val-ues instead of global variables or stack-allocated mem-ory, since the default alias analysis used by Polly onlyworks intra-procedural. In most cases however, twoarrays passed as parameters are disjoint; speculativelyassuming non-aliasing may thus be profitable.

Non-affine loop bounds This constraint requires all loopbounds to be affine expressions with respect to theparameters and surrounding iteration variables. Inour experiments we frequently observed that althoughbounds have not been affine at compile-time, they of-ten turn out to remain constant, and thus affine, dur-ing all executions. This is for example the case if aloop iterates N2 times, where N is a parameter. It isobvious that in these cases a specialized variant can begenerated where the loop bounds are considered con-stant. This not only makes the loop representable asa polyhedron, but also enables better optimizations inthe isl.

Function call Another major reason for rejecting a coderegion is contained function calls. In general, com-puting memory dependences through calls is a hardtask; for external functions or indirect calls it is oftenimpossible. Additionally, external functions may notterminate or have observable effects like text outputor aborting the application. Polly rejects each regioncontaining at least one call to a function that cannotbe proven to be pure. In practice though, paralleliza-tion prohibiting side-effects—for example exceptions,or error reporting—might manifest only infrequently.In such case, speculatively ignoring the calls and spe-cializing a region accordingly can make it amenableto polyhedral optimizations. To preserve the originalsemantics, the specialized code needs protection by aruntime speculation mechanism, e.g. in the form oftransactional memory.

Non-canonical induction variables This constraint re-quires the induction variables to be in a canonical form,starting at zero and being incremented by one in eachiteration. If LLVM’s and Polly’s preparation passesare unable to canonicalize all such variables, the coderegion is discarded.

Complex CFG This error is reported if complex termina-tors like switch instructions or are found, or the controlflow has a “complex form” not representable via whileand if constructs only.

Unsigned comparison During polyhedral optimizations,Polly may have to modify comparison operators or

their operands, for example to alter the iteration spaceof a loop. Special care has to be taken to handle pos-sible overflows correctly. To this end, this is only im-plemented for signed operations. Consequently, Pollyrejects all code regions containing unsigned compari-son.

Others This is a collection of minor technical limitations,for example rarely used LLVM instructions, which arenot handled properly yet. The only occurrence in ourtests is a cast from an integer to a pointer in the mcfprogram.

SPolly concentrates on the first three rejection causes. Inour benchmarks, they make up for 42% of all SCoP rejec-tions. Thus, assuming we could speculatively eliminate themin all cases, we could expect 660 new SCoPs to be detected,which is exactly 2.4 times the original number of SCoPs.The fourth cause, function calls, could be solved using run-time techniques as described, but in our experiments theintroduced overhead of transactional execution did not payoff. Therefore we skipped that for now, and consider it fu-ture work to improve the performance of the transactionalmemory system. All other reported causes are either con-ceptual obstacles (induction variables, complex CFGs) thatSPolly cannot remove or technicalities that will disappearwhen LLVM and Polly will mature further.

4. SPOLLYSPolly extends the applicability of the loop nest optimizer

and parallelizer Polly by deploying runtime information andspeculatively specializing functions. It targets restrictionson loop nests arising as a consequence of overestimations instatic analyses. By inferring common values for parameters,and providing additional conditions for memory accesses, itmakes polyhedral optimizations applicable to more code lo-cations, and allows for better optimization and code genera-tion. Those specialized versions will coexist with the originalsequential version and will be executed whenever the actualruntime values permit this. This section describes in detailhow SPolly achieves this.

4.1 Possible AliasingPossible aliasing is the main rejection cause of Polly on the

considered benchmarks of the SPEC 2000 suite and particu-larly well suited for speculation. An example for a loop withpossibly aliasing accesses is given in Figure 2a. It shows twoarrays accessed via pointers A and B. In case they point toaddresses less than N * sizeof(int) bytes apart, parallelexecution and other loop transformations possibly alter thesemantics. Most alias analyses are conservative and assumealiasing if A and B could potentially point to the same allo-cated memory block. Using the latter definition, Polly wouldnot optimize the presented loop without further informationon A and B.

Polly offers the possibility to override conservative as-sumptions concerning possible aliasing. This can be done byeither ignoring all aliasing, or by annotating individual coderegions. Both these approaches require manual interventionand profound knowledge of the application to optimize. Incontrast, SPolly is able to deal with possible aliases with-out any programmer-provided information. It does so byintroducing alias checks preceding the subject loop nests to

57

Page 66: ProceedingsofIMPACT2013 - cs.colostate.edu

void a(int *A, int *B) {

int i;

for (i=1; i<N; i++)

A[i] = 3 * B[i];

}

(a) Loop with possible aliasing pointers

void b(int *A, int N) {

int i, j;

for (i = 1; i<N; i++)

for (j = 0; j<N; j++)

A[j*N+i] += A[j*N+i-1];

}

(b) Loop nest using non-affine expressionsFigure 2: Example loops rejected by Polly for different speculatively resolvable reasons

ensure the absence of conflicts between accesses to loop in-variant base pointers. For the shown loop, those introducedchecks are conclusive as the accessed range, relative to thebase address, is known before entering the loop, thus non-aliasing of all accesses to the arrays can be checked a priori.This approach allows to optimize loops even if the used basepointers might point into the same allocated block, for ex-ample different parts of the same array. If the checks fail,the original, unmodified version is executed, otherwise theoptimized version is chosen. An example that would notbenefit from this approach because of possibly aliasing loopvariant pointers is dynamic multidimensional arrays (arraysof pointers to arrays). Iterating over those will change thebase pointer of the inner dimension for each execution ofthe outer loop, and checking that none of them alias is tooexpensive to be performed prior to execution of the loop.

4.2 Non-affine ParametersConsider Figure 2b. C does not provide support for n-

dimensional arrays of which the dimensions are not knownstatically. Hence, a common pattern is to implement n-dimensional arrays using a 1D array and performing the in-dex arithmetic “manually”. This pattern creates non-affinearray subscripts on the 1D array. Using static analysis, onecould infer that the 1D access is actually a 2D access; at leastin this example. However, such an analysis is currently notimplemented in Polly. Additionally, it is imaginable that thecode is more complicated making the static analysis unsuc-cessful.

Thus, SPolly utilizes runtime information gained with aprofiling version of the loop nest to identify reoccurring pa-rameter values. For those values specialized loop versionsare created with constant values plugged in for the prob-lematic parameters. Dispatching code is inserted before thecorresponding loop to decide at runtime whether a special-ized version exists for the actual parameter values. If noneis found, the original, less optimized version is used. In ourexample, by specialization the multiplications would becomeaffine and therefore representable in the polyhedral model,and thus amenable for all polyhedral optimizations imple-mented in Polly.

Finally, we observed (see the next section) that replacingparameters with constants often leads to superior code be-cause it enables more aggressive optimizations. Thus, evenif a static analysis would analyze pseudo-1D array accessesand restate them into multidimensional accesses, executingspecialized versions often leads to better performance.

5. EVALUATIONIn order to evaluate the success of incorporating runtime

information to speculatively optimize code regions, we will

Figure 3: Quantitative analysis of the applicability of SPollycompared to Polly. Overall, SPolly provides 635 sSCoPswhile Polly find 275 SCoPs.

investigate two hypotheses:

1. SPolly is able to handle substantially more regionsthan Polly. In other words, the number of specula-tive SCoPs (sSCoPs) is substantially larger than thenumber of SCoPs.

2. The extended applicability of the polyhedral model re-sults in improved runtime performance.

Hypothesis 1 is motivated from Section 3, where we showedthat the constraints that SPolly tackles are the main reasonsfor rejecting a SCoP. Hypothesis 2 follows from the primarygoal of polyhedral optimizations, which is reducing the pro-gram runtime.

5.1 Extended ApplicabilityIn order to evaluate the successful elimination of SCoP

rejection causes, we executed SPolly on the SPEC 2000 teststhat we already used in Section 3. We then compared thenumber of sSCoPs detected by SPolly against the numberof SCoPs detected by Polly. A graphical representation ofthis comparison is shown in Figure 3.

You can see that for all test cases except of “mcf”, ourspeculative extensions provided an increased number of coderegions amenable to polyhedral optimizations. The over-all number of sSCoPs is 635, as compared to the 275 de-tected SCoPs by Polly. This is an increase of 131 percent(360 additional SCoPs).

Compared to the expectations in Section 3, where we iden-tified an upper bound of 660 additional SCoPs, we reacheda success rate of 55 percent. The remaining sSCoP can-didates contain code where our heuristics decided that spec-ulation is not profitable or impossible. This is the case if,

58

Page 67: ProceedingsofIMPACT2013 - cs.colostate.edu

Figure 4: Speedups in execution time of SPolly-enabled programs on the PolyBench suite, normalized to the standard Clang-compiled version with all optimizations enabled

for example, an external function call is executed uncondi-tionally, or indirect pointer loads are detected where aliasingcannot be checked a priori.

Overall, we conclude that SPolly in fact widens the appli-cability of polyhedral optimizations substantially.

5.2 PerformanceAfter confirming that the number of SCoPs detected in

SPEC 2000 is indeed increased, we investigated to whichextent this actually improves the runtime of the programs.We observed that even though the additional SCoPs werevalid and some of them were executed, the most interestingcode regions were not optimized due to additional obstacleswe could not eliminate. These are mainly external functioncalls and indirect memory accesses in the hot loops. So onthese tests, we are unable to achieve significant speedups.

Thus we resort to the PolyBench suite [11], which hasalso been used by Grosser et al. [8, 7] before to evaluate theperformance of Polly. We are using the current version 3.2,and the provided “large” data sets.

In order to compare the speedup in program executionachievable by applying polyhedral optimizations on stati-cally validated SCoPs alone against that of speculativelycreated and optimized polyhedra, we run all tests

• on the standard toolchain of Clang and LLVM with alloptimizations enabled

• using Polly after applying the standard preparationpasses (see Section 3)

• using SPolly without any prior knowledge about theprogram

• using SPolly a second time, such that profiling infor-mation from the previous run is already available

All tests are conducted on an eight-core Intel Xeon machinerunning at 2.93 GHz with 24 GB of main memory.

Surprisingly, we are unable to reproduce earlier results ofPolly on the PolyBench suite. Investigating the reasons whyPolly rejects the SCoPs for the compute-intensive kernels ofall of the benchmarks revealed, that in all cases this is due toaliasing problems. Since PolyBench 3.0, released in October2011, the tests are using heap-allocated arrays, passed tothe kernel via pointer arguments plus array sizes. Polly isunable to prove non-aliasing of those pointers, thus rejectingall possibly affected SCoP candidates. This is why we didnot find any program for which Polly was able to improvethe performance over a standard Clang-compiled program.

Figure 4 shows the speedups of the SPolly-enabled pro-gram runs, normalized to the runtime of the correspondingClang-compiled program. Missing values in this bar chart in-dicate failures during optimization, which were mainly orig-inating from the CLooG code generator, but also from cre-ating the polytope in Polly. Nevertheless, for most of theprograms we are able to report runtime results. It is notsurprising that there are many programs where SPolly isnot able to bring the kernel to a form amenable for polyhe-dral optimizations. Thus for these programs no speedup isachieved.

However, there are also different programs where SPollyindeed provides enormous speedups. For the first run, SPollycan make no use of any profiling data, so only those spec-ulative transformations can be applied for which conclusiveruntime checks can be synthesized. For these runs, the high-est achieved speedup is 5.1-fold on the mvt program. In

59

Page 68: ProceedingsofIMPACT2013 - cs.colostate.edu

the second run however, specialized versions can be createdbased on the profiles gathered in previous runs. This allowsto replace loop bounds and parametric values by constantexpressions, which not only makes those code regions rep-resentable as polyhedra, but also helps other transforma-tions and the CLooG backend to create better code, e.g. bychoosing the best tile size for loop tiling. Thus, the speedupsachieved in the second runs are significantly higher in almostall cases, ranging to a maximum of 11-fold for the “gemm”program. In two cases (mvt and gemver), the speedup in thesecond run is slightly lower than in the first run. This is notdue to overhead we introduce for checking whether a special-ized code version can be chosen (this overhead was negligiblefor all our tests), but because the specialized variant actu-ally executed slower than the original one. The problem ofanticipating whether a code transformation, especially par-allelization, will pay of at runtime is a well-known problemfor which no general solution exists. Apart from that, it’snot in the scope of this paper.

6. CONCLUSIONIn this paper, we analyzed the most prominent causes

that prohibit polyhedral optimizations for individual coderegions. We observed that often promising regions are re-jected by the polyhedral framework Polly because of over-approximation in the static analyses, and because of miss-ing parameter values. Driven by this observation, we cameup with runtime checks to dynamically switch to special-ized variants of a code region, where the obstructive fea-ture is removed. This approach is implemented in SPolly,a speculative extension to Polly. It enables the applicationof polyhedral optimizations to many more code locations,providing better runtime performance in those cases wherethe specialized variant could be used.

7. REFERENCES[1] R. Baghdadi, A. Cohen, C. Bastoul, L.-N. Pouchet,

and L. Rauchwerger. The potential of synergisticstatic, dynamic and speculative loop nestoptimizations for automatic parallelization. InWorkshop on Parallel Execution of SequentialPrograms on Multi-core Architectures (PESPMA’10),June 2010.

[2] C. Bastoul. Improving Data Locality in Static ControlPrograms. PhD thesis, University Paris 6, Pierre etMarie Curie, France, 2004.

[3] C. Bastoul, A. Cohen, S. Girbal, S. Sharma, andO. Temam. Putting polyhedral loop transformationsto work. In Workshop on Languages and Compilersfor Parallel Computing (LCPC’03), LNCS, pages23–30. Springer-Verlag, Oct. 2003.

[4] U. Bondhugula, A. Hartono, J. Ramanujam, andP. Sadayappan. A practical automatic polyhedralparallelizer and locality optimizer. In Proceedings ofthe 2008 ACM SIGPLAN conference on Programminglanguage design and implementation, PLDI ’08, pages101–113, 2008.

[5] P. Feautrier. Dataflow analysis of array and scalarreferences. International Journal of ParallelProgramming, 20(1):23–53, 1991.

[6] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen,D. Parello, M. Sigler, and O. Temam. Semi-automatic

composition of loop transformations for deepparallelism and memory hierarchies. InternationalJournal of Parallel Programming, 34(3):261–317, 2006.

[7] T. Grosser. Enabling Polyhedral Optimizations inLLVM. Diploma thesis, University of Passau, Apr.2011.

[8] T. Grosser, H. Zheng, R. Aloor, A. Simburger,A. Grosslinger, and L.-N. Pouchet. Polly - PolyhedralOptimization in LLVM. In First InternationalWorkshop on Polyhedral Compilation Techniques(IMPACT’11), Apr. 2011.

[9] A. Jimborean, L. Mastrangelo, V. Loechner, andP. Clauss. VMAD: an Advanced Dynamic ProgramAnalysis & Instrumentation Framework. In CC - 21stInternational Conference on Compiler Construction,pages 220–237, Mar. 2012.

[10] C. Lattner and V. Adve. LLVM: A CompilationFramework for Lifelong Program Analysis &Transformation. In Proceedings of the 2004International Symposium on Code Generation andOptimization (CGO’04), Mar 2004.

[11] L.-N. Pouchet. Polybench, the Polyhedral Benchmarksuite. http://www.cse.ohio-state.edu/~pouchet/software/polybench/, 2010.

[12] B. Pradelle, A. Ketterlin, and P. Clauss. Polyhedralparallelization of binary code. ACM Transactions onArchitexture and Code Optimization, 8(4):39:1–39:21,Jan. 2012.

[13] K. Streit, C. Hammacher, A. Zeller, and S. Hack.Sambamba: A runtime system for online adaptiveparallelization. In Proc. 21st International Conferenceon Compiler Construction (CC), pages 240–243, Mar.2012.

60

Page 69: ProceedingsofIMPACT2013 - cs.colostate.edu

APPENDIX

clangclang-O2

clang-O3

SPolly1st run

SPolly2nd run

Speedup1st run

Speedup2nd run

bicg 0.66 0.22 0.22 0.32 0.23 0.72 0.99syrk 49.03 12.16 12.16 5.48 5.50 2.21 2.21

jacobi-1d-imper 0.95 0.23 0.23 0.25 0.25 0.93 0.93trmm 24.87 6.44 6.44 8.08 8.09 0.80 0.80symm 129.49 119.75 119.75syr2k 86.57 25.76 25.76 10.03 7.56 2.57 3.40

cholesky 6.71 1.68 1.68fdtd-apml 12.42 10.69 10.69 10.84 10.86 0.98 0.98

3mm 294.77 233.01 233.01 72.96 29.30 3.19 7.95lu 19.18 4.53 4.53 4.58 4.59 0.99 0.99

doitgen 45.07 10.95 10.95 14.95 2.00 0.73 5.46seidel-2d 1.30 1.12 1.12 1.12 1.12 1.00 1.00

gemm 107.93 77.48 77.48 24.49 7.05 3.16 10.98adi 12.56 9.31 9.31 16.10 7.88 0.57 1.18

gesummv 0.64 0.24 0.24floyd-warshall 72.31 13.70 13.70 14.05 14.05 0.98 0.98

ludcmp 31.62 20.94 20.94 22.31 22.26 0.94 0.94covariance 71.80 64.70 64.70 64.78 64.88 1.00 1.00

jacobi-2d-imper 1.23 0.48 0.48 0.49 0.49 0.97 0.97atax 0.81 0.19 0.19 0.12 1.57

fdtd-2d 5.26 2.07 2.07 3.63 2.43 0.57 0.85trisolv 0.20 0.05 0.05

2mm 206.63 155.09 155.09 48.75 39.92 3.18 3.88mvt 1.69 1.18 1.18 0.23 0.29 5.13 4.06

gemver 2.33 1.30 1.30 0.55 0.57 2.36 2.28reg detect 0.43 0.06 0.06 0.80 0.72 0.08 0.09correlation 71.81 64.73 64.73 64.80 64.75 1.00 1.00

gramschmidt 159.34 164.49 164.49dynprog 167.63 65.87 65.87 69.83 65.07 0.94 1.01

durbin 2.23 2.07 2.07 1.96 1.06

Figure 5: Runtime results on the PolyBench suite, comparing clang with different optimization levelsand SPolly. Empty cells represent runs which could not be completed due to technical issues withthe used frameworks.

61

Page 70: ProceedingsofIMPACT2013 - cs.colostate.edu
Page 71: ProceedingsofIMPACT2013 - cs.colostate.edu
Page 72: ProceedingsofIMPACT2013 - cs.colostate.edu