library generation for linear transforms

Library Generation ForLinear Transforms

Yevgen VoronenkoMay 2008

RS 1

RS 2

RS 3

RS 5

RS 4

RS 6 RS 7

Dissertation Submitted in Partial Fulfillment of theRequirements for the Degree ofDoctor of Philosophy inElectrical and Computer EngineeringCarnegie Institute of TechnologyCarnegie Mellon University

Acknowledgements

The roots of this thesis lie in the grand vision of my advisor Markus Puschel. Back in 2000, afterlistening to one of his talks, I thought it was a neat idea to generate a full library from some veryhigh-level rules that define the algorithm. Yet it was not at all clear that this would be possible.

I thank Markus for providing me with the never-ending source of inspiration and countlessweekends proofreading drafts of various papers and this thesis.

This work would not be possible without major contributions made by Franz Franchetti, whodeveloped a number of key mathematical insights, on which I heavily built my work. With Franzwe spent long hours arguing about how our rewriting systems should and should not work.

I would like to extend my special thanks to Peter Tang, Dmitry Baksheev, and Victor Paskowhom I had a chance to work with at Intel. Peter provided me with great encouragement to workon new and exciting projects, just to “see what happens”. I enjoyed having worked with Dmitryand Victor, and really hope to be able to continue our fruitful collaboration. I bring my sincereapologies for letting you guys suffer through my e-mail lag.

Many of the interesting experimental results of this thesis are possible due to work by Fredericde Mesmay, who helped me just because it was interesting, and Srinivas Chellappa who maintainedthe machines used for benchmarking.

I would also like to thank all other members of the Spiral group for lots of interesting discussionsand for suggesting many ideas that helped to improve this thesis.

I am very grateful to Jeremy Johnson who helped me enormously throughout my undergraduateyears at Drexel University, and also got me interested in symbolic computation, which before Iregarded as the most boring topic ever existed.

Many of the enabling ideas of this work go back to my internship at MathStar in Minneapoliswith Andrew Carter and Steven Hawkins, with whom we worked on the hypothetical hardwarecompiler. I am thankful to Andrew and Steve for making my work enjoyable and challenging.

Finally, I would like to thank my wife and children for their tremendous patience and supportduring my seemingly never-ending time at graduate school.

i

Abstract

The development of high-performance numeric libraries has become extraordinarily difficult dueto multiple processor cores, vector instruction sets, and deep memory hierarchies. To make thingsworse, often each library has to be re-implemented and re-optimized, whenever a new platform isreleased.

In this thesis we develop a library generator that completely automates the library develop-ment for one important numerical domain: linear transforms, which include the discrete Fouriertransform, discrete cosine transforms, filters, and discrete wavelet transforms. The input to ourgenerator is a specification of the transform and a set of recursive algorithms for the transform, rep-resented in a high-level domain-specific language; the output is a C++ library that supports generalinput size, is vectorized and multithreaded, and provides an optional adaptation mechanism for thememory hierarchy. Further, as we show in extensive benchmarks, the runtime performance of ourautomatically generated libraries is comparable to and often even higher than the best existinghuman-written code, including the widely used library FFTW and the commercially developed andmaintained Intel Integrated Performance Primitives (IPP) and AMD Performance Library (APL).

Our generator automates all library development steps typically performed manually by pro-grammers, such as analyzing the algorithm and finding the set of required recursive functions andbase cases, the appropriate restructuring of the algorithm to parallelize for multiple threads andto vectorize for the available vector instruction set and vector length, and performing code leveloptimizations such as algebraic simplification and others. The key to achieving full automation aswell as excellent performance is a proper set of abstraction layers in the form of domain-specificlanguages called SPL (Signal Processing Language), index-free

∑-SPL, regular

∑-SPL, and inter-

mediate code representation, and the use of rewriting systems to perform all difficult optimizationsat a suitable, high level of abstraction.

In addition, we demonstrate that our automatic library generation framework enables variousforms of customization that would be very costly to perform manually. As examples, we show gener-ated trade-offs between code size and performance, generated Java libraries obtained by modifyingthe backend, and functional customization for important transform variants.

iii

Contents

1 Introduction 1

1.1 Platform Evolution and the Difficulty of Library Development . . . . . . . . . . . . . 1

1.2 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Background 13

2.1 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Fast Transform Algorithms: SPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Spiral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4∑

-SPL and Loop Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.2 Motivation for the General Method . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.3 Loop Merging Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.4∑

-SPL: Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.5 The∑

-SPL Rewriting System . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.6 The∑

-SPL Rewriting System: Rader and Prime-Factor Example . . . . . . 35

2.4.7 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Library Generation: Library Structure 37

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Recursion Step Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Overview of the General Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.3 Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.4 Descend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.5 Computing the Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.6 Handling Multiple Breakdown Rules . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.7 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.8 Unification: From∑

-SPL Implementations to Function Calls . . . . . . . . . 50

3.4 Generating Base Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Representing Recursion: Descent Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Enabling Looped Recursion Steps: Index-Free∑

-SPL . . . . . . . . . . . . . . . . . 53

v

vi CONTENTS

3.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.2 Index-Free∑

-SPL: Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6.3 Ranked Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6.4 Ranked Function Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6.5 General λ-Lifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.6.6 Library Generation with Index-Free∑

-SPL . . . . . . . . . . . . . . . . . . . 59

3.7 Advanced Loop Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.7.1 GT: The Loop Non-Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.7.2 Loop Interchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.7.3 Loop Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.7.4 Strip-Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.8 Inplaceness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.9 Examples of Complicated Closures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 Library Generation: Parallelism 75

4.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.1.2 Vectorization by Rewriting SPL Formulas. . . . . . . . . . . . . . . . . . . . . 77

4.1.3 Vectorization by Rewriting∑

-SPL Formulas . . . . . . . . . . . . . . . . . . 79

4.1.4 Vectorized Closure Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2.2 Parallelization by Rewriting SPL Formulas . . . . . . . . . . . . . . . . . . . 86

4.2.3 Parallelization by Rewriting∑

-SPL Formulas . . . . . . . . . . . . . . . . . . 92

4.2.4 Parallelized Closure Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5 Library Generation: Library Implementation 97

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Recursion Step Semantics as Higher-Order Functions . . . . . . . . . . . . . . . . . . 99

5.3 Library Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4 Hot/Cold Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 Experimental Results 111

6.1 Overview and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2 Performance of Generated Libraries, Overview . . . . . . . . . . . . . . . . . . . . . 116

6.2.1 Transform Variety and the Common Case Usage Scenario . . . . . . . . . . . 116

6.2.2 Non 2-power sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2.3 Different Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2.4 Scalar Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.3 Higher-Dimensional Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.4 Efficiency of Vectorization and Parallelization . . . . . . . . . . . . . . . . . . . . . . 125

6.5 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.5.1 Functional Library Customization . . . . . . . . . . . . . . . . . . . . . . . . 127

6.5.2 Qualitative Library Customization . . . . . . . . . . . . . . . . . . . . . . . . 128

6.5.3 Backend Customization: Example, Java . . . . . . . . . . . . . . . . . . . . . 129

CONTENTS vii

6.5.4 Other Kinds of Customizaton . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.6 Detailed Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7 Future Work and Conclusions 1357.1 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.2 Current Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

A Detailed Performance Evaluation 141A.1 Intel Xeon 5160 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.2 AMD Opteron 2220 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157A.3 Intel Core 2 Extreme QX9650 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

viii CONTENTS

List of Figures

1.1 Evolution of Intel platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Numerical recipes versus the optimized implementation of the DFT . . . . . . . . . . 3

1.3 Library generator: input and output. The generated high performance librariesconsist of three main components: recursive functions, base cases, and additionalinfrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Performance of automatically generated DFT and 2D DCT-2 libraries. Platform:Dual-core Intel Xeon 5160, 3 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Adaptive program generator Spiral. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Simple loop merging example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Loop merging and code generation for SPL formulas. (*) marks the transform/algo-rithm specific pass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Library generation in Spiral. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Possible call tree for computing DFT1024 in FFTW 2.x using Implementation 7(functions names do not correspond to actual FFTW functions). . . . . . . . . . . . 41

3.3 Library generation: “Library Structure”. Input: transforms and breakdown rules.Output: the recursion step closure (if it exists) and

∑-SPL implementations of each

recursion step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Graphical representation of the recursion step closure obtained from the Cooley-Tukey FFT (3.1). The closure in (b) corresponds to (3.15). . . . . . . . . . . . . . . 48

3.5 Descent tree for DFT32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Descent tree for DFT32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.7 Descent trees corresponding to different downranking orderings of GT non-terminalassociated with Im ⊗A⊗ Ik. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.8 Call graphs for the generated libraries with looped recursion steps (corresponding toTable 3.11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.1 The space of implementations including both vector and thread parallelism. . . . . 75

4.2 Descent tree for Vec2(DFT32). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3 Multicore Cooley-Tukey FFT for p processors and cache line length µ. . . . . . . . . 91

4.4 Multithreaded C99 OpenMP function computing y = DFT8 x using 2 processors,called by a sequential program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.5 GT iteration space partition implied by GT parallelization rules (4.30)–(4.33). . . . 94

ix

x LIST OF FIGURES

5.1 Library generation: “Library Implementation”. Input: recursion step closure and∑

-SPL implementations. Output: library implementation in target language. . . . 102

5.2 Parameter flow graph obtained from the library plan from Table 5.1 and variousstages of the automatic hot/cold partitioning algorithm. Node color and shape de-notes state: white square = “none”, black square =“cold”, gray square = “hot”,gray octagon = “reinit”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.1 Generated libraries for complex and real Fourier transforms (DFT and RDFT). Theseare the most widely used transforms, and both vendor libraries and FFTW are highlyoptimized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2 Generated libraries for discrete cosine and discrete Hartley transforms (DCT-2 andDHT). Less effort is spent on hand-optimizing these transforms, and thus hand-written libraries are either slower or lack the functionality. Neither IPP nor APLprovide the DCT-4, and in addition APL does not implement DCT-2. . . . . . . . . 118

6.3 Generated libraries for discrete Hartley transform and Walsh Hadamard transform(DHT and WHT). Less effort is spent on optimizing these transforms, and thushand-written libraries are much slower. IPP and APL do not provide DHT andWHT at all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.4 Generated libraries for FIR filters, first row is a plain filter, second row is a filterdownsampled by 2, which is the building block for a 2-channel wavelet transform. . . 120

6.5 Generate DFT library for sizes N = 2i3j5k. No threading is used. . . . . . . . . . . . 121

6.6 Comparison of generated libraries for DFT and DCT-2 in double (2-way vectorized)and single (4-way vectorized) precision. No threading. . . . . . . . . . . . . . . . . . 122

6.7 Generated scalar DFT and RDFT libraries versus scalar FFTW. No threading. . . . 123

6.8 Generated 2-dimensional DFT and DCT-2 libraries, up to 2 threads. IPP does notprovide double precision 2-dimensional transforms. . . . . . . . . . . . . . . . . . . 124

6.9 Vectorization efficiency. Platform: Intel Xeon. . . . . . . . . . . . . . . . . . . . . . . 126

6.10 Parallelization efficiency. 4 threads give no speedup over 2 threads, due to lowmemory bandwidth. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . . . . . . 126

6.11 Better parallelization efficiency due to better memory bandwidth. Platform: IntelCore 2 Extreme QX9650. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.12 Generated RDFT libraries with built-in scaling. Generated libraries incur no per-formance penalty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.13 Performance vs. code size tradeoff in a generated DFT library. KLOC = kilo (thou-sands) lines of code. Generated libraries are for the DFT and 2-power sizes only.FFTW includes other transforms and supports all transforms sizes. . . . . . . . . . 129

6.14 Generated Java libraries performance: trigonometric transforms. . . . . . . . . . . . 131

A.1 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), no threading. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . . . 142

A.2 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), no threading. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . 143

A.3 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), no threading. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . 144

A.4 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), up to 2 threads. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . . 145

LIST OF FIGURES xi

A.5 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), up to 2 threads. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . 146

A.6 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), up to 2 threads. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . 147

A.7 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), up to 4 threads. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . . 148

A.8 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), up to 4 threads. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . 149

A.9 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), up to 4 threads. Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . 150

A.10 Generated libraries performance: FIR filter, downsampling = 1, varying number oftaps, up to 1 thread(s). Platform: Intel Xeon 5160. . . . . . . . . . . . . . . . . . . . 151

A.11 Generated libraries performance: FIR filter, varying length, up to 1 thread(s). Plat-form: Intel Xeon 5160. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151











A.22 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), no threading. Platform: AMD Opteron 2220. . . . . . . . . . . . . . . . . 158

A.23 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), no threading. Platform: AMD Opteron 2220. . . . . . . . . . . . . . . 159

A.24 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), no threading. Platform: AMD Opteron 2220. . . . . . . . . . . . . . . 160

A.25 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), up to 2 threads. Platform: AMD Opteron 2220. . . . . . . . . . . . . . . 161

A.26 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), up to 2 threads. Platform: AMD Opteron 2220. . . . . . . . . . . . . 162

xii LIST OF FIGURES

A.27 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), up to 2 threads. Platform: AMD Opteron 2220. . . . . . . . . . . . . 163

A.28 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), up to 4 threads. Platform: AMD Opteron 2220. . . . . . . . . . . . . . . 164

A.29 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), up to 4 threads. Platform: AMD Opteron 2220. . . . . . . . . . . . . 165

A.30 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), up to 4 threads. Platform: AMD Opteron 2220. . . . . . . . . . . . . 166

A.31 Generated libraries performance: FIR filter, downsampling = 1, varying number oftaps, up to 1 thread(s). Platform: AMD Opteron 2220. . . . . . . . . . . . . . . . . 167

A.32 Generated libraries performance: FIR filter, varying length, up to 1 thread(s). Plat-form: AMD Opteron 2220. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167











A.43 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), no threading. Platform: Intel Core 2 Extreme QX9650. . . . . . . . . . . 174

A.44 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), no threading. Platform: Intel Core 2 Extreme QX9650. . . . . . . . . 175

A.45 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), no threading. Platform: Intel Core 2 Extreme QX9650. . . . . . . . . 176

A.46 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), up to 2 threads. Platform: Intel Core 2 Extreme QX9650. . . . . . . . . . 177

A.47 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), up to 2 threads. Platform: Intel Core 2 Extreme QX9650. . . . . . . . 178

A.48 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), up to 2 threads. Platform: Intel Core 2 Extreme QX9650. . . . . . . . 179

LIST OF FIGURES xiii

A.49 Generated libraries performance: trigonometric transforms, no vectorization (doubleprecision), up to 4 threads. Platform: Intel Core 2 Extreme QX9650. . . . . . . . . . 180

A.50 Generated libraries performance: trigonometric transforms, 2-way vectorization (dou-ble precision), up to 4 threads. Platform: Intel Core 2 Extreme QX9650. . . . . . . . 181

A.51 Generated libraries performance: trigonometric transforms, 4-way vectorization (sin-gle precision), up to 4 threads. Platform: Intel Core 2 Extreme QX9650. . . . . . . . 182

xiv LIST OF FIGURES

List of Tables

1.1 Generations of Spiral. The columns correspond to the algorithm optimizations andthe rows represent different types of code. Each entry indicates for what transformssuch code can be generated. The code types are illustrated in Table 1.2. “All”denotes all transforms in Table 2.1 shown in Chapter 2. “Most” excludes the discretecosine/sine transforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Code types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Linear transforms and their defining matrices. . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Definition of the most important SPL constructs in Backus-Naur form; n, k arepositive integers, α, ai real numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Examples of breakdown rules written in SPL. Above, Q,P,K, V,W are various per-mutation matrices, D are diagonal matrices, and B,C,E,G,M,N, S are other sparsematrices, whose precise form is irrelevant. . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Translating SPL constructs to code: examples. . . . . . . . . . . . . . . . . . . . . . 20

2.5 Translating∑

-SPL constructs to code. . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Rules to expand the skeleton. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7 Loop merging rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.8 Index function simplification rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Descending into the recursion step (3.10) using Cooley-Tukey FFT (3.1). The left-hand side is the original recursion step, and the right-hand sides are as follows:“Descend” is the result of application of the breakdown rule and

∑-SPL rewriting,

“RDescend” is the list of parametrized recursion steps needed for the implementation,“RDescend*” is same as “RDescend”, but uses “*” to denote the parameter slot. . 47

3.2 Comparison of descent trees and ruletrees. . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Rank-k index mapping functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 Rewrite rules for converting SPL to ranked∑

-SPL. . . . . . . . . . . . . . . . . . . 60

3.5 Rewrite rules for ranked functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Converting SPL to ranked∑

-SPL and GT. . . . . . . . . . . . . . . . . . . . . . . . 62

3.7 Loop interchange example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.8 Loop distribution example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.9 Strip-mining example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.10 Generated recursion step closures for DFT, RDFT, and DCT-4 with looped re-cursion steps in index-free

∑-SPL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

xv

xvi LIST OF TABLES

3.11 Generated recursion step closures for DFT, RDFT, and DCT-4 (of Table 3.10)with looped recursion steps using GT. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1 SPL vectorization rules for the stride permutation. . . . . . . . . . . . . . . . . . . . 794.2 SPL vectorization rules for tensor products. A is an n× n matrix. . . . . . . . . . . 794.3 Converting other constructs to GT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4 Recursion step closure and call graph for the 2-way vectorized DCT-4. . . . . . . . 844.5 SPL shared memory parallelization rules. P is any permutation. . . . . . . . . . . . 884.6

∑-SPL shared memory parallelization rules. T = GT(A, f, g, n), S = GTI(A, f, n).

934.7 Recursion step closure and call graph for the 2 thread parallelized DFT. Compare

to the DFT closure in Table 3.11 and call graph in Fig. 3.8(a). . . . . . . . . . . . . 95

5.1 Library plan constructed from the recursion step closure in (5.1). . . . . . . . . . . . 1015.2 Fragment of the automatically generated code for DFTu1

(with our commentsadded), based on the recursion step closure in Fig. 3.4 and library plan in Table 5.1.Func 1(...) creates a generating function for the diagonal elements, and is passeddown to the child recursion step (RS 3), which will use it to generate the constants.Since each iteration of the loop uses a different portion of the constants, severalcopies of Env 3 are created. Such copying is always required if a child recursion stephas “reinit” parameters (u7 in RS 3 in this case). . . . . . . . . . . . . . . . . . . . 109

6.1 Breakdown rules for a variety of transforms: DFT for real input (RDFT), discreteHartley transform (DHT), discrete cosine transforms (DCTs) of types 2–4. The otherare auxiliary transforms needed to compute them. P,Q are permutation matrices,T are diagonal matrices, B,C,D,N are other sparse matrices. The first and secondrule are respectively for four and two transforms simultaneously. . . . . . . . . . . . 113

6.2 Number of recursion steps m/n in our generated libraries. m is the number of stepswith loops; n is the number without loops and a close approximation of the numberof base cases (codelets) needed for each small input size. . . . . . . . . . . . . . . . . 113

6.3 Code size of our generated libraries. KLOC = kilo (thousands) lines of code. WHT,scaled and 2D transforms are not shown in all possible variants. . . . . . . . . . . . 114

6.4 Number of recursion steps in JTransforms Java library. . . . . . . . . . . . . . . . . 132

Chapter 1

Introduction

Many compute intensive applications process large data sets or operate under real-time constraints,which poses the need for very fast software implementations. Important examples can be groupedinto scientific computing applications such as weather simulations, computer aided drug discovery,or oil exploration, consumer and multimedia applications such as image and video compression,medical imaging, or speech recognition, and embedded applications such as wireless communication,signal processing, and control.

In many of these applications the bulk of the computation is performed by well-defined mathe-matical functionality and application developers rely on high-performance numerical libraries thatprovide this functionality. If these libraries are fast then the applications that use them will runfast too. As computing platforms change, ideally only the library has to be updated to port theapplication and take advantage of new platform features. Commonly used numeric library domainsinclude dense and sparse linear algebra, linear transforms, optimization, coding, and others.

However, as we will explain, the development of libraries that achieve the best possible per-formance has become very difficult and costly due to the evolution of computer platforms. Thisthesis aims at completely automating the library development and library optimization withoutsacrificing performance for one of the most commonly used domains: linear transforms.

The most prominent example of a linear transform is the discrete Fourier transform (DFT),which is arguably among the most important tools used across disciplines in science and engineering.Other important linear transforms include digital filters, discrete cosine and sine transforms, andwavelet transforms.

To better understand the problem of developing high performance libraries, and hence themotivation for this thesis, we start by analyzing the evolution of computer platforms.

1.1 Platform Evolution and the Difficulty of Library Development

Figure 1.1 shows the evolution of both clock frequency and floating-point peak performance (singleand double precision) of Intel platforms over the past 15 years. The growth pattern identifies thekey problems in the development of high performance libraries, as explained next.

The plot shows that early on the growth rates of CPU frequency and peak performance werecoupled. However, in recent years CPU frequency has stalled (at around 3 GHz), while the peakperformance continues to increase (even at a faster rate). This means that hardware vendorshave turned to increasing on-chip parallelism to improve peak performance. On the other hand,

1

2 CHAPTER 1. INTRODUCTION

free speedup

work required

Evolution of Intel Platforms

10

100

1,000

10,000

100,000

1993 1995 1997 1999 2001 2003 2005 2007

Year

single precision

double precision

CPU frequency

Floating point peak performance [Mflop/s] CPU frequency [MHz]

Pentium

Pentium Pro

Pentium II

Pentium III

Pentium 4Core 2 Duo Core 2

Extreme

data: www.sandpile.org

Figure 1.1: Evolution of Intel platforms.

clock frequency cannot keep growing due to physical limitations. This paradigm shift means thatany further performance improvement in numerical libraries will largely depend on the effectiveexploitation of the available on-chip parallelism, which is usually difficult. While clock frequencyscaling translates into “free” speedup for numeric libraries across time, exploiting the on-chipparallelism usually requires programming effort.

Modern processors have three types of on-chip parallelism: instruction-level parallelism, vector(SIMD) parallelism, and thread parallelism. Instruction-level parallelism enables the processor toexecute multiple instructions concurrently, by means of out-of-order execution. This is the onlytype of parallelism that requires no software development effort, even though it may require properinstruction scheduling for optimal efficiency.

Vector or SIMD (single instruction multiple data) parallelism is realized by adding vector ex-tensions to the instruction set that enable parallel processing of multiple data elements. The onlyways to exploit SIMD parallelism are to use assembly code, to use special C/C++ intrinsics, or torely on automatic compiler vectorization. The latter tends to be far suboptimal for most numericalproblems.

Finally, thread parallelism is the coarse-grain parallelism provided by incorporating multipleprocessor cores into a single CPU. Exploiting this type of parallelism requires programs with mul-tiple threads of control.

To complicate matters further, the exponential growth of peak performance has not been

1.1. PLATFORM EVOLUTION AND THE DIFFICULTY OF LIBRARY DEVELOPMENT 3

0

4

8

12

16

20

8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

20x

Dual-core Intel Xeon 5160, 3 GHz

Generated Library

(this thesis)

Numerical Recipes

Complex DFT, single precision

Performance [Gflop/s]

size

Figure 1.2: Numerical recipes versus the optimized implementation of the DFT .

matched by the increase in memory bandwidth (not shown in Fig. 1.1), leading to the processor–memory bottleneck problem and the emergence of multi-level cache memory hierarchies, multiplememory banks, and prefetch and temporal load/store instructions, designed to alleviate the prob-lem. To achieve the best possible performance a library hence has to also be optimized and “tuned”to the peculiarities of the memory hierarchy.

As a result, library development is very time consuming and expensive, requires expert domainknowledge, expert platform knowledge, and expert programming skills, which means very fewpeople are trained to do it. Further, performance is in general not portable, which means thatlibraries have to be reoptimized or even reimplemented for every new platform.

To better illustrate these problems consider the plot in Fig. 1.2, which compares the referenceimplementation of the discrete Fourier transform (DFT) from “Numerical Recipes” [94], a stan-dard book on numerical programming, and a highly platform optimized implementation, obtainedautomatically using the methods in this thesis. The platform is the latest dual-core workstationwith an Intel Xeon 5160 processor. The performance difference is up to a surprising factor of 20,even though both implementations have roughly the same operations count. To obtain the bestperformance, the optimized library is 4-way vectorized using platform-specific SIMD instructions,uses up to 2 threads, and is optimized for the memory hierarchy.

The “Numerical Recipes” implementation was compiled using the latest (at the time of writing)Intel C Compiler 10.1. Although the compiler provides automatic vectorization and parallelization,it could not parallelize or vectorize any parts of this program. This is typical for non-trivialfunctionality like the DFT, because these optimizations require domain knowledge in the form ofsophisticated high-level algorithm manipulations, which are not feasible with a standard compiler.


dft(n, Y, X, …) child1->compute(Y, X) child2->compute(Y, Y)

dft2 (Y, X) …dft3 (Y, X) …dft5 (Y, X) …

Recursion

Library Generator

Base casesinit(n)

_dat = LibAlloc ( …);

Target Infrastructure

High-Performance Generated Library

Figure 1.3: Library generator: input and output. The generated high performance libraries consist of three maincomponents: recursive functions, base cases, and additional infrastructure.

1.2 Goal of the Thesis

The goal of this thesis is the computer generation of high performance librariesfor the entire domain of linear transforms, given only a high level algorithmspecification, request for multithreading, and SIMD vector length.

In other words, we want to achieve complete automation in library development for an entiredomain of structurally complex numerical algorithms. This way the cost of library development andmaintenance is dramatically reduced and the very fast porting to new platforms becomes possible.Further, library generation provides various other benefits. For example, it expands the set offunctionality for which high performance is readily available, and it enables library customization.

We generate transform libraries given only a high-level specification (in a domain-specific lan-guage) of the recursive algorithms that the library should use. For example, a typical input to ourlibrary generator is

Transform: DFTn,

Algorithms: DFTkm → (DFTk⊗Im) diag(Ωk,m)(Ik ⊗DFTm)Lkmk ,

DFT2 →[

1 11 −1

],

Vectorization: 2-way SSE

Multithreading: yes

The output is a generated library that is

• for general input size;

• vectorized using the available SIMD vector instruction set;

• multithreaded with a fixed or variable number of threads;

1.2. GOAL OF THE THESIS 5

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Generated library

Intel IPP 5.2

FFTW 3.2alpha2

input size

Complex DFT, double precision (2-way vectorization), up to 2 threads


(a) DFT library (15,000 lines of code, 0.75 MB)

0

2

4

6

8

10

12

14

8x8

16x8

8x16

16x16

32x16

16x32

32x32

64x32

32x64

64x64

128x64

64x128

128x128

256x128

128x256

256x256

512x256

256x512

512x512

Generated library

Intel IPP 5.2

FFTW 3.2alpha2

input size

2D DCT-2, single precision (4-way vectorization), up to 2 threads


(b) 2D DCT-2 library (27,000 lines of code, 2.1 MB)

Figure 1.4: Performance of automatically generated DFT and 2D DCT-2 libraries. Platform: Dual-core Intel Xeon5160, 3 GHz.

• (optionally) equipped with a feedback-driven platform adaptation mechanism to select thebest recursion strategy at runtime, and thus to automatically adapt to the underlying memoryhierarchy and other platform-specific details;

• performance competitive with the best existing hand-written libraries.

The last point is particularly important, it means that even though we completely automate librarydevelopment, there is no performance loss compared to an expertly hand-written and hand-tunedlibrary. In fact, as we will show, our generated libraries often outperform the best existing hand-written code.

Fig. 1.3 illustrates the concept of a library generator. The generated libraries consist of threemain components whose exact form is automatically derived and implemented by the generator:a) a set of recursive functions needed to compute the transform; b) a set of optimized base casefunctions that implement small size transforms and terminate the recursion; c) an infrastructureresponsible for initialization, such as precomputations, and adaptive search. Not shown in Fig. 1.3is the efficient use of vector instructions and threading, also obtained automatically.

The structure of the generated libraries is roughly modeled after FFTW [61], a widely usedDFT library. We discuss FFTW later together with other related work.

The target platforms we consider are modern workstations, such as systems based on Intel Core2 Duo, and AMD Opteron processors. These workstations have deep memory hierarchies, SIMDvector extensions, and shared memory multicore processors.

As an example result we show in Figure 1.4 the performance of two automatically generatedlibraries, for the DFT and for the 2-dimensional DCT-2 (discrete cosine transform of type 2). Wecompare the performance of our generated library to the best available (hand-written) libraries:FFTW [61] and Intel IPP on an Intel Xeon 5160 (server variant of Core 2 Duo) processor machine.We show performance in pseudo Gflop/s, which is standard practice in the domain of transforms.Pseudo Gflop/s are computed as

normalized arithmetic cost

runtime [sec]· 109,


assuming the normalized arithmetic cost (number of additions and multiplications) of 5n log n forthe DFT of size n, and 2.5mn log mn for the m×n 2-dimensional DCT-2. Because of the normalizedarithmetic cost, pseudo Gflop/s are proportional to inverse runtime.

In the case of the DFT, where considerable development effort is spent, the performance is com-parable, even though our generated library is for most sizes somewhat faster than both competitors.IPP mainly suffers from the lack of threading. For the DCT-2, however, where less developmenteffort is invested, the performance of the generated library is on average a factor of 4 faster thanFFTW and IPP. In FFTW (and most likely IPP) the DCT is implemented using a conversion toa real-data DFT. The primary reason is to reuse code and to save on the implementation effort.This indirect method results in performance degradation. In contrast, our generated library is a“native” DCT library that offers the same performance as the DFT.

Note the large size of our generated libraries (15,000 and 27,000 lines of code, respectively). Thisis typical for high-performance libraries since necessary optimizations include loop and recursionunrolling and also require the introduction and implementation of various versions of functions usedfor efficient recursive computation. This need for large code size is another argument for automaticlibrary generation.

1.3 Related Work

The high development costs for numerical libraries, the need to reoptimize libraries for every newplatform, and in some cases the lack of desired functionality have spawned academic research effortson automating the library development and optimization process. Most of the automation work isfocused on the performance-critical domains of linear algebra and linear transforms, but there arealso research efforts in other domains.

Domain: Linear transforms. In the domain of linear transforms, the DFT libraries FFTW [61]and UHFFT [84] partially automate the development process by employing a limited form of codegeneration and automatic runtime selection of the best DFT algorithm.

Specifically, FFTW and UHFFT use a special “codelet generator” [5, 59] to generate fullyunrolled code for small fixed size transform functions, called “codelets”. These codelets serve as abuilding blocks for larger transforms. There are many different types of codelets, and the codeletgenerators are used to automatically generate a cross-product of the required codelet types anddesired transform sizes.

The second form of automation is the runtime selection of the best algorithm. Both librariesprovide hand-written recursions for the transform computations, and the base cases of these re-cursions are the codelets. There are multiple alternative recursions, and at runtime the fastestalternative is chosen using a heuristic feedback-driven search. Since the best choice is platformdependent, this search provides a form of automatic platform tuning.

Compared to our generated libraries in Fig. 1.3 FFTW and UHFFT automate the base casedevelopment, but, besides that, are still hand-written libraries based on a sophisticated design.In particular, the design includes vectorization, parallelization, and recursion structure to achievehigh performance. As a consequence extending FFTW or developing a new FFTW-like library is amajor undertaking. For example, consider the manual steps necessary to add a new DFT algorithmto one of these libraries. First, the general-size algorithm has to be hand-developed to obtain anew transform recursion. The recursion will typically decompose the original DFT “problem” intosmaller problems, which could be other transforms, or other DFT types. Next, the code has to be

1.3. RELATED WORK 7

analyzed to determine if new problem types (i.e. transform variants) are required (which is usuallythe case), and if yes, then either the new recursions must be implemented for the new problemtypes, or existing recursions must be modified to support the new problem types. Finally, the codeis analyzed to determine the set of required codelet types, and if new codelet types are required,then the codelet generator is modified to be able to generate them.

In addition to the above steps, considerable effort is required for efficient vectorization andparallelization. The parallelization will most likely affect only the top-level recursion. However,vectorization may require new problem types, and thus a waterfall of changes, including new codelettypes, and modifications to the codelet generator.

Inspired by the ingenious design of FFTW, our goal is to eliminate all manual work and generatecomplete libraries like FFTW, for a large set of linear transforms.

Another effort in the domain of linear transforms is a program generator Spiral [99], whichserved as the foundation of our work. We will discuss Spiral later.

Domain: Dense linear algebra. Virtually all dense linear algebra functionality depends onthe efficient implementation of basic matrix operations called BLAS (basic linear algebra subrou-tines). The most prominent dense linear algebra library is the widely used LAPACK [8], The BLASare typically implemented and reoptimized by the hardware vendors for each new platform. Forexample, high-performance platform optimized BLAS are provided by the Intel MKL and AMDACML libraries.

ATLAS [130,131] (Automatically Tuned Linear Algebra Subroutines) is an automatically tunedBLAS library, which uses a code generator together with parameter search mechanism to automat-ically adapt to the underlying platform. Most of the critical BLAS functionality reduces to severalflavors of matrix-matrix multiply (MMM) operation, and thus the focus of ATLAS is to provide avery high-performance MMM implementation. ATLAS uses a code generator to generate severalvariants of the performance critical inner kernel of the MMM, the so-called micro-MMM. In addi-tion, ATLAS uses feedback-driven search to find the best values of the implementation parameters,for example, matrix block size and the best choice of the inner kernel.

ATLAS provides only partial automation. The (hand-written) ATLAS infrastructure providesthreading, but not vectorization, which should be done as optimization in the micro-MMM kernel.The ATLAS code generator, however, does not generate vectorized micro-MMM kernels. If vec-torization is desired, as is the case with practically all modern processors, then the micro-MMMkernel has to be written again by hand.

The FLAME [23] (Formal Linear Algebra Method Environment) project enables the semi-automatic derivation of dense linear algebra matrix algorithms (such as the ones in LAPACK)from the concise specification of a loop invariant and the so-called partitioned matrix expression.The generated algorithms can be translated into code using FLAME APIs [24], and even intoparallel code [138] by combining FLAME APIs and task queue parallelism. The code relies on theavailability of fast, vectorized BLAS.

Domain: Sparse linear algebra. Most of the sparse linear algebra functionality is builtaround the sparse matrix-vector product (SpMV), which hence becomes the performance-criticalfunction in this domain. The goal of the OSKI (Optimized Sparse Kernel Interface, earlier calledSparsity) project [44, 68] is to automate the optimization and tuning of the SpMV. The overalldesign is similar to ATLAS, the library uses a set of generated kernels which perform small densematrix-vector products of different block sizes, and at runtime the library chooses the best algorithmand the best kernel size, employing a combination of a heuristic model, and a runtime benchmark.


Unlike ATLAS, in the sparse case the best choices depend on the sparsity pattern of the matrix,and thus these decisions must be done dynamically at runtime, rather at code generation time, asin ATLAS.

Other domains. The tensor contraction engine [15,16] generates code for tensor contractionsfrom a specification in a domain-specific language. The tool performance code reorganization inorder to jointly optimize for the minimal arithmetic cost and the minimal amount of temporarystorage.

Adaptive sorting libraries [22, 79, 80] include several sorting algorithms, and use a runtimeselection mechanism to determine the best algorithm from the statistical properties of the inputdata.

Generative programming and domain-specific languages. Independent of the abovework, the idea of program generation (programs that write other programs), also called generativeprogramming, has recently gained considerable interest in the programming language and, to someextent, the software engineering community [1, 13, 14, 39, 111]. The basic goal is to reduce thedevelopment, maintenance, and analysis of software. Among the key tools for achieving thesegoals are domain-specific languages that raise the level of abstraction for specific problem domainsand hence enable the more compact representation and the manipulation of programs [19, 40, 65,67]. However, this community has to date not considered numerical problems nor performanceoptimization as goal.

In this thesis we show that domain-specific languages and generative programming techniquescan be used for numerical problems and, more importantly, to achieve very high performance. Thekey is to design the domain-specific language based on the mathematics of the domain; this way,sophisticated optimizations such as parallelization and vectorization can be done at a high level ofabstraction using rewriting systems.

Platform Vendor Libraries. A major part of the numerical library development is done byhardware platform vendors, which provide optimized high performance implementations of impor-tant numerical algorithms for many domains, including linear transforms. Examples include theIntel Math Kernel Library (MKL) and Intel Integrated Performance Primitives (IPP), the AMDPerformance Library (APL) and AMD Core Math Library (ACML), IBM ESSL, and the SunPerformance Library.

For the library user these hardware vendor libraries make it possible to tap into the highperformance of available platforms, without delving into low-level implementation details. At thesame time they enable the porting to new platforms provided the libraries have been updated bythe vendor.

Developing and maintaining these libraries is becoming more and more expensive for the reasonswe discussed earlier. This makes automation an important goal with real world impact.

As part of the work on this thesis and as our joint effort with Intel [11] some of our generatedcode was included in the Intel MKL starting with version 9.0. A much larger set of our generatedcode will be included in Intel IPP, which, starting with version 6.0 in late 2008, will provide aspecial domain, which consists exclusively of our automatically generated libraries. The domain iscalled ippg, where the “g” stands for generated. The introduction of ippg may mark the first timethat the development of performance libraries is done by a computer rather than a human.

1.4. CONTRIBUTION OF THE THESIS 9

(a) The first generation of Spiral reported in [99].

Code type Scalar Vectorized Threaded Threaded & vectorized

Fixed size, straightline all DFT n/a n/aFixed size, looped DFT DFT - -Fixed size, looped with reuse - - - -General size library, recursive - - - -

(b) The second generation of Spiral.


Fixed size, straightline all DFT n/a n/aFixed size, looped most DFT DFT DFTFixed size, looped with reuse most DFT DFT DFTGeneral size library, recursive DFT - - -

(c) The third generation of Spiral: the contribution of this thesis.


Fixed size, straightline all all n/a n/aFixed size, looped all all all allFixed size, looped with reuse all all all allGeneral size library, recursive all all all all

Table 1.1: Generations of Spiral. The columns correspond to the algorithm optimizations and the rows representdifferent types of code. Each entry indicates for what transforms such code can be generated. The code types areillustrated in Table 1.2. “All” denotes all transforms in Table 2.1 shown in Chapter 2. “Most” excludes the discretecosine/sine transforms.

1.4 Contribution of the Thesis

Our contribution, to the best of our knowledge, is the design and implementation of the firstsystem that completely automates high performance library development for an entire numericaldomain and for state-of-the-art workstation computers. The domain are linear transforms, whichhave structurally very complex algorithms as we will see later. The word “high performance”is important since our generated libraries match, and often supersede, the performance of thebest hand-written libraries. In other words, we achieve complete automation without sacrifices inperformance.

Equally important as the abilities of our library generator are the methods used. Namely,the framework underlying our generator is based on domain-specific languages and rewriting sys-tems, tools that to date have barely been used in the area of library development or performanceoptimization (except, to some extent, for the prior work on Spiral explained next).

To explain our contributions to greater detail, we first discuss prior work on Spiral that isrelevant for this thesis.

Prior work. Unlike FFTW and UHFFT discussed earlier, Spiral is not a library, but a programgenerator, which can generate standalone programs for transforms. In contrast to other work on


(a) Fixed size, unrolled (b) Fixed size, looped

void dft_4(complex *Y, complex *X)

complex s, t, t2 , t3;

t = (X[0] + X[2]);

t2 = (X[0] - X[2]);

t3 = (X[1] + X[3]);

s = __I__*(X[1] - X[3]);

Y[0] = (t + t3);

Y[2] = (t - t3);

Y[1] = (t2 + s);

Y[3] = (t2 - s);


complex T[4];

for ( int i = 0; i <= 1; i++)

T[2*i] = D[2*i]*(X[i] + X[i+2]);

T[2*i+1] = D[2*i+1]*(X[i] - X[i+2]);

for ( int j = 0; j <= 1; j++)

Y[j] = T[j] + T[j+2];

Y[2+j] = T[j] - T[j+2];

(c) Fixed size, looped with reuse (d) General size library, recursive


for ( int i = 0; i <= 1; i++)

dft_2_1(T, X, 2*i, D+2*i, i);

for ( int j = 0; j <= 1; j++)

dft_2_2(Y, T, j, j);

struct dft : public Env

dft( int n); // constructor

void compute(complex *Y, complex *X);

int _rule , f, n;

char *_dat ;

Env *child1 , *child2;

;

void dft:: compute(complex *Y, complex *X)

child2 ->compute(Y, X, n, f, n, f, f);

child1 ->compute(Y, Y, n, f, n, n/f, n/f);

Table 1.2: Code types.

automatic performance tuning, Spiral uses internally a domain-specific language called SPL (signalprocessing language). In SPL, divide-and-conquer algorithms for transforms are expressed as rulesand stored in Spiral. For a user-specified transform and transform size, Spiral applies these rulesto generate different algorithms, represented in SPL. An SPL compiler translates these algorithmsinto C code, which is further optimized. Based on the runtime of the generated code, a searchengine explores different choices of algorithms to find the best match to the computing platform.

The first generation of Spiral is described in full detail in [99] based on the papers [50,51,100,108,110,137]. We introduce all relevant details as background in Chapter 2.

Table 1.1(a) shows the abilities of this first generation of Spiral, described in detail in [99]. Thetypes of code referred to in the rows of this table are illustrated in Table 1.2. The main limitationsof the the first generation of Spiral were the inability to

• generate loop code for transforms other than the DFT based on the Cooley-Tukey FFTalgorithm;

1.4. CONTRIBUTION OF THE THESIS 11

• generate vector code for transforms other than the DFT;

• generate multithreaded code;

• generate general input size libraries.

The reasons for these limitations were rooted in the underlying framework. First, it could notsupport generality across transforms, except for the generation of scalar, straightline code, similarto FFTW. Second, the optimizations for loop code and vector code were, in a sense, “hard-coded”using a mechanism that did not allow for generalization to a large class of transforms. Third, therewas no support for multithreading. Fourth, and most importantly, Spiral’s approach was limited togenerating code only for a given transform of fixed input size, which greatly simplifies the problemas all optimizations can be inlined and there is no control flow in the program. The automaticgeneration of a complete, general input size library like FFTW or others was out of reach.

Contribution of this thesis. Our contribution is the complete design and complete imple-mentation of what we call the “third generation of Spiral” with the abilities shown in Table 1.1(c).As an intermediate step, and part of our thesis work, we first designed and implemented fromscratch the second generation of Spiral with the abilities shown in Table 1.1(b). This second gen-eration already included new concepts and ideas to overcome some of the limitations of the firstgeneration, in particular general framework for loop merging [52], a redesigned and more generalvectorization framework [54] and preliminary multithreading [53].

Full generality across transforms, across the types of code, and, in particular, the framework togenerate general input size libraries was achieved with the third generation of Spiral, which is themain subject of this thesis.

To achieve the abilities in Table 1.1(c), our library generator includes and integrates a number ofnovel ideas and techniques from the general areas of programming languages, symbolic computation,and compilers. We list the most important techniques developed in this thesis:

• We introduce a new mathematical domain-specific language, called∑

-SPL, to symbolicallyrepresent transform algorithms.

∑-SPL is an extension to SPL [137], which is at the core of

the first generation of Spiral.∑

-SPL is of mathematical nature, but makes explicit the de-scriptions of loops, index mappings, parametrization, and recursive calls. We further developextensions to

∑-SPL necessary for the generation of general-size libraries explained below.

• We design and use rewriting systems for performing difficult optimizations such as paralleliza-tion, vectorization, and loop merging automatically. The rewriting operates on the

∑-SPL

representation of algorithms, i.e., on a high abstraction level. This way these optimizationsand become feasible and overcome known compiler limitations.

• We design and use a rewriting framework to automatically derive the∑

-SPL recursion stepclosure, a term introduced in this thesis to describe the smallest set of mutually recursive func-tions (expressed in

∑-SPL) sufficient to compute a given set of transforms. This framework

requires∑

-SPL extensions including the loop non-terminal, ranked functions, and index-free∑

-SPL, all introduced in this thesis.

• We developed a compiler that translates the∑

-SPL recursion step closure into actual codein an intermediate representation. The compiler further optimizes the code using several


standard compiler techniques. We implemented two target language backends which trans-late this intermediate code representation into Java or C++, and generates the necessaryinitialization and glue code to obtain the desired libraries.1

• Finally, we integrate all the above components into a completely automatic, “push-button,”library generation system that we designed and implemented. The system requires additionalsoftware engineering techniques such as conditional term-rewriting and various object-orientedextensions to simplify programming and other tasks.

1.5 Organization of the Thesis

The rest of this thesis is organized as follows. In Chapter 2 we review background on transforms,their fast algorithms, explain the Spiral framework, and discuss

∑-SPL and loop merging optimiza-

tion, which will be an important component for library generation. In Chapter 3 we overview thethe general size library generation process, explain the derivation of the library structure, whichcomputes the recursion step closure using

∑-SPL, and also explain several restructuring loop op-

timizations which can be done on the∑

-SPL level. In Chapter 4, we describe the parallelismframework, which enables automatic parallelization and vectorization at the SPL and

∑-SPL lev-

els. In Chapter 5 we describe the library generation backend, which compiles the recursion stepclosure into the target language code, in particular, we explain the C++ target. In Chapter 6 weshow experimental results, comparing the performance of a variety of generated libraries to thestate of the art hand-written libraries. In Chapter 7 we talk about future work and conclude.

1Parts of this backend were developed jointly with Frederic de Mesmay.

Chapter 2

Background

In this chapter, we give background on linear transforms and their fast algorithms. We describe thedeclarative representation of fast algorithms, called SPL, and show how this representation can betranslated into code. Next, we overview Spiral, a program generator for linear transforms, whichuses SPL to automatically generate code for linear transforms and underlies our work. Finally,we present

∑-SPL, a lower level extension of SPL, which solves the fundamental problem of loop

merging in SPL, and enables robust looped code generation, by following the path SPL → ∑-SPL

→ code.

2.1 Transforms

Any linear signal transform can be defined as a matrix-vector product

y = Mx,

where M is the fixed transform matrix, and x, y are the input and output vectors, respectively. Mcan be real or complex and correspondingly x and y are real or complex. There is a large number ofwell-known linear transforms. We list the most important ones below along with relevant citations.

• Discrete Fourier transform (DFT) [121,122];

• Walsh Hadamard transform (WHT) [17,18];

• Real discrete Fourier transform (RDFT, also known as DFT of real data or real-valued DFT)[20,98,112,125];

• Discrete Hartley transform (DHT) [31,32,98,125,126];

• Discrete cosine and sine transforms of types 1–4 (DCT-t and DST-t, with t ∈ 1, 2, 3, 4)[33,96,97,103];

• Modified discrete cosine transform (MDCT) and its inverse (IMDCT) [82]

• FIR filter [62,93];

• Downsampled FIR filter (as part of the wavelet transform) [62,113,124].

13

14 CHAPTER 2. BACKGROUND

Main transforms

DFTn =[

ωkℓn

]

0≤k,ℓ<n, ωn = e−2πj/n WHTn =

[WHTn/2 WHTn/2

WHTn/2 −WHTn/2

]

RDFTn =

I ′′n[cos 2πkℓ

n

− sin 2πkℓn

]

DHTn =

I ′′n[cas 2πkℓ

n

cms 2πkℓn

]

DCT-1n =

[

coskℓπ

n− 1

]

DST-1n =

[

sin(k + 1)(ℓ + 1)π

n + 1

]

DCT-2n =

[

cosk(2ℓ + 1)π

2n

]

DST-2n =

[

sin(k + 1)(2ℓ + 1)π

2n

]

DCT-3n =

[

cos(2k + 1)ℓπ

2n

]

DST-3n =

[

sin(2k + 1)(ℓ + 1)π

2n

]

DCT-4n =

[

cos(2k + 1)(2ℓ + 1)π

4n

]

DST-4n =

[

sin(2k + 1)(2ℓ + 1)π

4n

]

MDCTn =

[

cos(2k + 1)(2ℓ + 1 + n)π

4n

]

IMDCTn =

[

cos(2ℓ + 1)(2k + 1 + n)π

4n

]

Filtn(t) =

t0 . . . tk−1

. . . . . .. . .

t0 . . . tk−1

↓2Filtn(t) =

t0 t1 . . . tk−1

t0 t1 . . . tk−1

. . . . . . . . .. . .

t0 . . . tk−1

Auxiliary transforms

RDFT-2n =

I ′′n[

cos πk(2ℓ+1)n

− sin πk(2ℓ+1)n

]

DHT-2n =

I ′′n[

cas πk(2ℓ+1)n

cms πk(2ℓ+1)n

]

RDFT-3n =

[

cos π(2k+1)ℓn

− sin π(2k+1)ℓn

]

DHT-3n =

[

cas π(2k+1)ℓn

cms 2π(2k+1)ℓn

]

DCT-3n(r) =

[

cosrkℓπ

n

]

DST-3n(r) =

[

sinrk(ℓ + 1)π

n

]

Table 2.1: Linear transforms and their defining matrices.

2.2. FAST TRANSFORM ALGORITHMS: SPL 15

In Table 2.1 we show the respective transform matrices. Besides these main transforms, Ta-ble 2.1 also shows some auxiliary transforms. These are rarely used in practical applications, butarise as building blocks of fast algorithms for the main transforms.

In each case the subscript n specifies the length of the input vector x, i.e., the number of columnsof the matrix. In Table 2.1, all transforms are n×n matrices, except the IMDCTn which is 2n×n,and the MDCTn which is n× 2n. Transforms are always bold-faced in this thesis.

The goal of our work is to enable library generation for these and other linear transforms. InChapter 6 we show generated libraries for a large subset of transforms in Table 2.1.

2.2 Fast Transform Algorithms: SPL

A generic matrix-vector product for an n× n matrix requires Θ(n2) arithmetic operations [34]. Incontrast, for most transforms there exist fast algorithms that exploit the structure of the transformto reduce the complexity to O(n log n). Every fast algorithm can be viewed as a factorizationof the dense transform matrix M into a product of structured sparse matrices. Namely, if M =MkMk−1 . . . M1, and Mi are sparse, the product y = Mx can be computed as a sequence of matrix-vector multiplications:

t1 = M1x,

t2 = M2t1,

. . .

tk−1 = Mk−1tk−2,

y = Mktk−1.

We will express such factorizations using the domain-specific language SPL (Signal ProcessingLanguage). SPL is derived from matrix algebra and was described in [99, 137]. It is an extensionof the Kronecker product formalism introduced by Van Loan in [122] and Johnson, et al. in [70].We call elements of SPL formulas.

SPL. SPL consists of

• predefined sparse and dense matrices (including transforms);

• arbitrary matrices of predefined structure;

• arbitrary unstructured matrices; and

• matrix operators.

Predefined matrices include the transforms, like DFTn and DCT-2n, and several other matricesthat are used as building blocks. These include the n × n identity matrix In and the reversedidentity matrix Jn:

In =

1. . .

1

, Jn =

1

. ..

1

.


A 2× 2 butterfly matrix and a rotation matrix are defined by

F2 =

[1 11 −1

]

, Rα =

[cos α − sin αsin α cos α

]

.

Another predefined matrix Lnm with n = mk is the n× n stride permutation matrix defined by the

underlying permutation jk + i 7→ im + j for 0 ≤ i < k, 0 ≤ j < m. In words, Lnm transposes an

m× k matrix stored in row-major order. Lnn/2 is also called perfect shuffle. For example,

L84 =

1 · · · · · · ·· · · · 1 · · ·· 1 · · · · · ·· · · · · 1 · ·· · 1 · · · · ·· · · · · · 1 ·· · · 1 · · · ·· · · · · · · 1

Arbitrary matrices of predefined structure include permutation and diagonal matrices. For example,diag(a0, a1, . . . , an−1) is an n× n diagonal matrix with diagonal entries ai:

diag(a0, a1, . . . , an−1) =

a0

a1

. . .

an−1

.

Permutation matrices are defined by their underlying permutations i 7→ π(i). If π is a permutationon 0, . . . , n− 1 then the corresponding permutation matrix has 1 at the positions (i, π(i)) and iszero everywhere else.

Arbitrary unstructured matrices include matrices that cannot be decomposed into smaller build-ing blocks using matrix operators (defined below).

[1√

20 1

]

is an example of such matrix.

The three major matrix operators in SPL are the matrix product A ·B = AB, the matrix directsum

A⊕B =

[A

B

]

,

and the tensor productA⊗B = [akℓ ·B]k,ℓ , A = [akℓ]k,ℓ.

Two important special cases of the tensor product arise when A or B are the identity matrix. In


〈spl〉 ::= 〈generic〉 | 〈symbol〉 | 〈transform〉 |〈spl〉 · · · · · 〈spl〉 | (product)〈spl〉 ⊕ . . .⊕ 〈spl〉 | (direct sum)〈spl〉 ⊗ · · · · ⊗〈spl〉 | (tensor product)In ⊗k 〈spl〉 | In ⊗k 〈spl〉 | (overlapped tensor product)

〈spl〉 | (conversion to real). . .

〈generic〉 ::= diag(a0, . . . , an−1) | . . .〈symbol〉 ::= In | Jn | Ln

k | Rα | F2 | . . .〈transform〉 ::= DFTn | RDFTn | DCT-2n | Filtn(h(z)) | . . .

Table 2.2: Definition of the most important SPL constructs in Backus-Naur form; n, k are positive integers, α, ai

real numbers.

particular

In ⊗A =

A.. .

A

= A⊕ . . .⊕A

︸︷︷︸

n

,

expresses the obvious block diagonal stacking (i.e., a direct sum) of n copies of A. Computingy = (In ⊗A)x, where A is m×m, means that x is divided into n equal subvectors of size m each,and each subvector is independently multiplied with A.

The second special case is

A⊗ In =

a0,0In . . . a0,m−1In

.... . .

...am−1,0In . . . am−1,m−1

, A = [akℓ]k,ℓ.

The above can be viewed as either the “coarse-grain” version of A (i.e., A that operates on vectorsof size m instead of scalars) or as a tight interleaving of n copies of A. Computing y = (A⊗ In)x,where A is m×m, means that x is divided into m equal subvectors of size n, and now A is applied tothese n subvectors as if they were scalars. Alternatively, y is a tight interleaving of n independentmatrix-vector products of stride-m portions of x with A.

All of the SPL matrix operators also have iterative forms:

n−1⊕

i=0

Ai = A0 ⊕ . . .⊕An−1 =

A0

. . .

An−1

,

n−1∏

i=0

Ai = A0 · · · · ·An−1,

n−1⊗

i=0

Ai = A0 ⊗ . . .⊗An−1.

Table 2.2 provides a grammar for SPL in Backus-Naur form (BNF) [105]. It defines the SPL


as the disjoint union of choices (separated by |) for each symbol marked by 〈·〉. Such symbols arecalled non-terminal symbols in BNF terminology, and all other symbols are called terminals. Assaid before, we call the elements of SPL formulas.

SPL is a domain-specific language, because it can only be used to describe structured matrices.Further, SPL is declarative since it only describes the structure or dataflow of the computation andnot the actual implementation.

Divide-and-conquer algorithms as breakdown rules. Every fast transform algorithm isobtained through a succession of divide-and-conquer decompositions. We call these decompositionsbreakdown rules and express them in SPL following [99]. A breakdown rule typically decomposes atransform into several smaller transforms or converts it into a different transform of usually lessercomplexity. The left-hand side of a breakdown rule is a transform, and the right-hand side isthe corresponding decomposition, expressed as an SPL formula. Examples of breakdown rules areshown in Table 2.3.

Instead of equality, we use → to indicate that these rules are applied by replacing the left-handside of the rule by the right-hand side.

Breakdown rules (2.1)–(2.4), for example, correspond to the well known Cooley-Tukey [38],prime-factor [64], Rader [102], and Bluestein [27] DFT algorithms, respectively.

Some breakdown rules require auxiliary transforms, and to implement these auxiliary trans-forms, additional breakdown rules are necessary. For example, in Table 2.3, rDFT and rDHT areauxiliary transforms, neededed to implement RDFT and DHT using (2.6).

A few hundred breakdown rules can be found in the literature. Examples include [20,32,38,43,47,64,89,97,102,103,121,123,125,126].

For a complete description of a transform algorithm, besides breakdown rules, also base caserules are needed. For example:

DFT2 → F2 =[

1 11 −1

], (2.19)

DCT-22 → diag(1, 1/√

2)F2, (2.20)

DCT-42 → J2R13π/8, (2.21)

DCT-43 → (I1 ⊕ J2)(F2 ⊕ I1)

(√

3/8I1 ⊕√

1/2

[1/2 11 −1

])

(F2 ⊕ I1)(I1 ⊕ J2). (2.22)

We will call a set of breakdown and base cases rules sufficient for a transform T , if it is possibleto fully expand T for a set of sizes using these rules. For example, rules (2.9), (2.13), and (2.20)are sufficient for DCT-2 and DCT-4 of 2-power sizes. Rules (2.9) and (2.20) are not sufficient forDCT-2, since the rule for DCT-4 is missing.

Algorithm space. For a given transform, the recursive composition of breakdown rules andthe choices of rules at each level yields a very large space of alternative formulas. Each formulacorresponds to a fast algorithm. Most of the formulas have approximately the same operationscount, but have different data flow.

From formulas to code. SPL formulas can be directly translated into code by applying codegeneration templates as explained in [137]. We show a few examples of templates in Table 2.4.

In Section 2.4 we explain the limitations of the template based approach, and explain an alter-native code generation method.


Breakdown rules for main transforms

DFTn −→ (DFTk ⊗Im)Dk,m(Ik ⊗DFTm)Lnk (2.1)

DFTn −→ V −1m,k(DFTk ⊗Im)(Ik ⊗DFTm)Vm,k (2.2)

DFTn −→ W−1n (I1 ⊕DFTn−1)En(I1 ⊕DFTn−1)Wn (2.3)

DFTn −→ B⊤n,mDm DFTm D′

m DFTm D′′mBn,m, m ≥ 2n− 1 (2.4)

DFTn −→ P⊤k/2,2m

(DFT2m⊕

(Ik/2−1 ⊗i C2m rDFT2m((i + 1)/k)

)) (RDFT′

k ⊗Im

)

(2.5)∣∣∣∣∣∣∣∣

RDFTn

RDFT′n

DHTn

DHT′n

∣∣∣∣∣∣∣∣

−→ (P⊤k/2,m ⊗ I2)

∣∣∣∣∣∣∣∣

RDFT2m

RDFT′2m

DHT2m

DHT′2m

∣∣∣∣∣∣∣∣

⊕

Ik/2−1 ⊗i M2m

∣∣∣∣∣∣∣∣

rDFT2m((i + 1)/k)rDFT2m((i + 1)/k)rDHT2m((i + 1)/k)rDHT2m((i + 1)/k)

∣∣∣∣∣∣∣∣

·

∣∣∣∣∣∣∣∣

RDFT′k

RDFT′k

DHT′k

DHT′k

∣∣∣∣∣∣∣∣

⊗ Im

(2.6)

RDFTn −→ Dn ·DCT-2n ·Pn, n odd (2.7)

DCT-2n −→ P⊤k/2,2m

(

DCT-22m K2m2 ⊕

(

Ik/2−1 ⊗N2m RDFT-3⊤2m

))

Gn(Ln/2k/2 ⊗ I2)

· (Im ⊗RDFT′k)Qm/2,k (2.8)

DCT-2n −→ Lnn/2 · (DCT-2n/2⊕DCT-4n/2) ·

[In/2 Jn/2

In/2 −Jn/2

]

(2.9)

DCT-2n −→ Sn ·RDFTn ·Kn2 (2.10)

DCT-3n −→ DCT-2⊤n (2.11)

DCT-4n −→ Q⊤k/2,2m

(

Ik/2 ⊗N2m RDFT-3⊤2m

)

G′n(L

n/2k/2 ⊗ I2)(Im ⊗RDFT-3k)Qm/2,k

(2.12)

DCT-4n −→ S′n ·DCT-2n ·D′′′

n (2.13)

MDCTn −→ DCT-4n ·[ −Jn In

In −Jn

](2.14)

IMDCTn −→[

In−Jn

−JnIn

]

·DCT-4n (2.15)

WHTn −→ (WHTk ⊗Im)(Ik ⊗WHTm) (2.16)

Breakdown rules for auxiliary transforms

∣∣∣∣

rDFT2n(u)rDHT2n(u)

∣∣∣∣−→ L2n

m

(

Ik ⊗i

∣∣∣∣

rDFT2m((i + u)/k)rDHT2m((i + u)/k)

∣∣∣∣

)(∣∣∣∣

rDFT2k(u)rDHT2k(u)

∣∣∣∣⊗ Im

)

(2.17)

RDFT-3n −→ (Q⊤k/2,m ⊗ I2) (Ik ⊗i rDFT2m(i + 1/2)/k)) (RDFT-3k ⊗Im) (2.18)

Table 2.3: Examples of breakdown rules written in SPL. Above, Q, P, K, V, W are various permutation matrices, Dare diagonal matrices, and B, C, E, G, M, N, S are other sparse matrices, whose precise form is irrelevant.


Formula M Code for y = Mx

F2y[0] = x[0] + x[1];

y[1] = x[0] - x[1];

Lnk

for (i=0; i<k; i++)

for (j=0; j<n/k; j++)

y[j+i*n/k] = x[j*k+i];

(A⊕B)<code for : y[0:1:k-1] = A * x[0:1:k-1]>

<code for : y[k:1:k+m-1] = B * x[k:1:k+m-1]>

(In ⊗A)for (j=0; j<n; j++)

<code for : y[jk:1:jk+m-1] = A * x[jk:1:jk+m-1]>

(A⊗ In)for (j=0; j<n; j++)

<code for : y[j:n:j+n(k -1)] = A * x[j:n:j+n(k-1)]>

Table 2.4: Translating SPL constructs to code: examples.

2.3 Spiral

Spiral [99] is an automatic program generation and optimization system for linear signal transforms,which builds on the framework explained in Sections 2.1–2.2. Spiral uses SPL to represents fasttransform algorithms, and automates the process of obtaining fast fixed size transform implemen-tation, starting just from the breakdown rules and base case rules.

The high level structure of the optimization and code generation process in Spiral is shown inFig. 2.1. The input to Spiral is a formally specified transform of a fixed size (known at generationtime), e.g., “DFT1024”; the output is an optimized C program that computes the transform.

The generated program is automatically tuned to the platform on which Spiral is installed usinga feedback directed search in the space of alternative algorithms. We now explain this process inmore detail.

Program generation in Spiral. After the user specifies a transform to be implemented,the Search/Learning module applies different breakdown rules to the transform to generate alter-native SPL formulas. This process is guided by a heuristic feedback-driven search, with differentpossibilities for the search strategies.

The feedback is obtained either by predicting the performance in a certain way, e.g. [109,110], or by translating the formula into target language (e.g., C) code and measuring the desiredperformance metric, which usually is the runtime of the code.

Before generating code, each formula undergoes optimization at several abstraction levels.

At the algorithm level, in the Formula Generation stage, Spiral applies breakdown rules suchas (2.1) to the given transform to generate one out of many possible formulas represented inSPL. In the Formula Optimization stage Spiral applies high level algorithmic optimizations usingrewriting systems [45]. This block was fully developed in this thesis to handle loop optimizations,

2.3. SPIRAL 21

Formula Generation

Formula Optimization

Implementation

Code Optimization

Compilation

Performance Evaluation

Linear transform and size

Optimized/adapted implementation

Se

arc

h

controls

controls

performance

algorithm as formula

in SPL language

C/Fortran

implementation

Algorithm

Level

Implementation

Level

Evaluation

Level

Figure 2.1: Adaptive program generator Spiral.

vectorization, and parallelization for a large class of transforms (Table 1.1(c)). In the original Spiralsystem [99] (Table 1.1(a)) this block consisted only of a preliminary form of formula vectorization.

At the implementation level, in the Implementation stage, Spiral translates the optimized for-mula into an intermediate code representation. In the Code Optimization stage, further optimiza-tions are performed on the intermediate code representation including strength reduction, constantfolding, copy propagation, and array scalar replacement [137]. The result is then unparsed to a Cprogram.

At the evaluation level, the final program is compiled using a standard C compiler in the Com-pilation stage and the actual runtime or other performance metric is measured in the PerformanceEvaluation stage.

To date, several search strategies have been implemented in Spiral. The most effective strategiesare dynamic programming and evolutionary (genetic) search [108]. The different search strategiesare described in detail in [99].

Limitations of Spiral. While describing a large algorithm space for a wide variety of trans-forms, the original Spiral was limited by the types of code it could produce, as we explained inTable 1.1(a). The major limitations of the original Spiral are listed below.

• No general loop merging. Efficient loop code could not be produced for transforms otherthan the DFT, this problem is discussed in detail in the following section.

• Fixed size transforms only. Transform sizes must be known at generation time. Theproblem of producing general size code is the main focus of this thesis. We explain why thisis difficult in Section 3.2.

• No support for general vectorization and parallelization. Only vectorized DFTs couldbe produced by the original Spiral. Multithreaded code for transforms could not be produced.Further, some of our early experiments with parallel code were limited by the requirement


that the number of threads is known at generation time, similarly to how the transform sizesmust be fixed and known.

In this thesis, we address all of these limitations. We solve the general loop merging problemin Section 2.4 Chapters 3 and 5 explain how the fixed size limitation can be overcome. Chapter 4presents the general vectorization and parallelization framework.

2.4∑

-SPL and Loop Merging

Most of the fast transforms algorithms (see Table 2.3) are “divide-and-conquer,” which means thata transform is decomposed into smaller transforms. In principle, this produces a structure thatis well-suited for achieving good performance on memory hierarchies. However, the conquer stepin these algorithms is iterative, i.e., requires extra passes through the data. In particular, someof these steps are complicated permutations, which can deteriorate performance considerably. Asa consequence, one of the keys to obtaining high performance is to merge these iterative stepsto improve data locality and data reuse. For example, permutations can be converted into areindexing in the subsequent computational loop. The Cooley-Tukey FFT breakdown rule in (2.1),for instance, contains four factors, and thus a straightforward implementation would perform fourpasses over the data. However, in practice (2.1) is implemented in only two passes, as explainedlater.

Small example. To illustrate the problem, we show a small example of translating a simpleSPL formula to code. Consider, the formula

M = (I4 ⊗ F2)L84. (2.23)

To obtain code for y = Mx we use the translation rules from Table 2.4, and get:

// Input: double x[8], output: y[8]

double t[8];

for ( int i=0; i<4; i++)

for ( int j=0; j<2; j++)

t[j+i*2] = x[j*4+i];

for ( int j=0; j<2; j++)

y[2*j] = t[2*j] + t[2*j+1];

y[2*j+1] = t[2*j] - t[2*j+1];

However, this is known to be suboptimal, since the the stride permutation L84 can be incorporated

into the computation loop, instead of being performed explicitly. This eliminates the redundantpass over the array, and leads to the simplified code with only a single loop below:

// Input: double x[8], output: y[8]

for ( int j=0; j<2; j++)

y[2*j] = x[j] + x[j+4];

y[2*j+1] = x[j] - x[j+4];

In the original Spiral a special template was used to handle this specific formula, and implementthe optimization above.

2.4.∑

-SPL AND LOOP MERGING 23

The goal of the loop merging optimizations is to automate the above optimizations without theuse of specialized templates.

2.4.1 Overview

The general problem of loop merging [42,72] is NP-complete. Further, loop merging requires arraydependence information, for which the most general methods, like [95], achieve exact results only ifthe array indices are affine expressions, and even for this class the analysis has exponential worst-case runtime. Since many transform algorithms use non-affine index mappings, standard arraydependence tests do not work in general. We developed a simple domain-specific form of loopmerging, which does not face these problems.

In the domain of linear transforms, the problem of loop merging has been completely solved onlyfor the DFT and only for one DFT method: the Cooley-Tukey FFT (see [122]). Examples includethe first generation of Spiral [99], and the DFT libraries FFTW [48,61] and UHFFT [84]. For theprime-factor FFT, which involves a different permutation, [122] describes a partial solution withoutthe explicit permutation, which can only be used for a restrictive set of DFT sizes. This solution isimplemented manually in UHFFT. Consequently, these libraries achieve high performance for DFTsizes that factor into small prime numbers, but are suboptimal for other sizes. More importantly,loop merging is not solved for other transforms. For example, the DCTs/DSTs in FFTW, as wealready saw from Fig. 1.4 suffer a large performance penalty. For these transforms, in FFTW somelimited conquer step merging has been performed manually, and as a result FFTW outperformsIntel IPP, however, overall the performance is still rather suboptimal.

Since our goal is automatic library generation, loop merging has to be performed automaticallyand across a wide range of transforms and breakdown rules, including the DCT/DST algorithms,the different DFT algorithms, and other breakdown rules in Table 2.3. The solution to this problemis an extension to SPL, called

∑-SPL (Sigma-SPL), which makes loops and index mapping functions

explicit. Using∑

-SPL, loop merging can be performed efficiently through rewriting [52]. In thefollowing chapters

∑-SPL will also serve as the main tool for library generation.

2.4.2 Motivation for the General Method

There are two major problems which cannot be solved by using templates, and that the generalloop merging framework is expected to solve. First, there exists a large variety of breakdown ruleswith different permutations (see Table 2.3) and hence different instantiations of the above problem.Second, the breakdown rules are applied recursively, or in other words, inserted into each otherto produce several levels of permutations and other simple operations like scaling (i.e. diagonalmatrices in SPL) that ideally all would be fused. Both of these lead to a combinatorial explosionof the number of required SPL templates.

We will use the DFT and its most important recursive algorithms as the motivating examplethroughout this section.

The three most important FFTs shown in (2.1)–(2.4) and restated below are respectively called


Cooley-Tukey, prime-factor (or Good-Thomas), and Rader:

DFTn = (DFTk ⊗Im)Dk,m(Ik ⊗DFTm)Lnk , (2.24)

DFTn = V −1n,k (DFTk ⊗Im)(Ik ⊗DFTm)Vn,k, (2.25)

DFTn = W−1n (I1⊕DFTp−1)En(I1⊕DFTp−1)Wn. (2.26)

Each of these FFTs reduces the problem of computing a DFT of size n to computing several DFTsof different and smaller sizes. The applicability of each FFT depends on the size n:

• The Cooley-Tukey FFT requires that n = km factors and reduces the DFTn to m DFTk’sand k DFTm’s.

• The prime-factor FFT requires that n = km factors and that the factors are coprime:gcd(k,m) = 1. As in the Cooley-Tukey FFT, the DFTn is then computed using m DFTk’sand k DFTm’s.

• The Rader FFT requires that n = p is prime and reduces the DFTp to 2 DFTp−1’s.

In (2.24)–(2.26), L, V,W are permutation matrices, D is a diagonal matrix, and E is “almost”diagonal with 2 additional off-diagonal entries.

The permutations matrices L, V,W in (2.24)–(2.26) correspond to the following permutations:

ℓmkk : i 7→

⌊im

⌋+ k (i mod m) , (2.27)

vm,k : i 7→ (m⌊

im

⌋+ k (i mod m)) mod mk, gcd(m,k) = 1, (2.28)

wng : i 7→

0, if i = 0,

gi−1 mod n, else.(2.29)

where g is a suitably chosen, fixed integer1. Note, that (2.27) is equivalent to our previous definitionof jk + i 7→ im + j, but here we uniformly express all permutations as i→ p(i).

For high performance it is crucial to handle these permutations, and also their combinations(since the DFT is computed recursively) efficiently. The “affine” permutation (2.27) has beenstudied extensively in the compiler literature, whereas the more expensive mappings (2.28) and(2.29) have not received much attention, since they only occur in FFTs. Below we will identifythe transformations necessary to optimize the recursive combinations of these permutations. Thesetransformations are not performed on the actual code, where they would be prohibitively expensive,but at a higher level of abstraction provided by

∑-SPL introduced in this section.

Implementation in FFTW. FFTW implements only (2.24) and (2.26) as the top-level recur-sive algorithms. In (2.24) the permutation L is never explicitly performed, instead the stride valueis passed as an argument in the recursion. The composition of strides yields a new stride, whichis possible, because of special properties of L. Similarly, the scaling by D·,· (the twiddle factors)is not performed as an extra step, but, also is passed down the recursion and finally performedby special “twiddle codelets.” As a consequence, each Cooley-Tukey recursion requires only twopasses through the data instead of four. Besides (2.24), FFTW supports also (2.26), for which thepermutations are performed explicitly.

1Specifically, g is a generator of the multiplicative group (Z/nZ)×.

2.4.∑


x0 y0

x1 y1

x2 y2

x3 y3

x4 y4

x5 y5

x6 y6

x7 y7

L8

4I4 ⊗ F2

F2

F2

F2

F2

(a) Original: y = (I4 ⊗ F2)L8

4x (two passes)

x0 y0

x1 y1

x2 y2

x3 y3

x4 y4

x5 y5

x6 y6

x7 y7

I4 ⊗ F2

F2

F2

F2

F2

(b) After loop merging: y =3

P

j=0

“

S(h2j, 1)F2G(hj, 2)”

x (one

pass)

Figure 2.2: Simple loop merging example.

In other words, the FFTW implementation faces the same problem as the original Spiral.Namely, the loop merging in the Cooley-Tukey FFT (2.24) is implemented manually as a specialcase, and efficiently handling alternative recursions (with other permutations) requires additionalmanual effort.

Implementation in Spiral. As we briefly explained above, for (2.24), the original Spiral(Table 1.1(a)) performed optimizations equivalent to FFTW. Namely, when translating an SPLformula based on (2.24) into code, the SPL compiler fused the twiddle factors and the stridepermutation with the adjacent loops using a special purpose templates for SPL expressions of theform (Ik ⊗ Am)Lkm

k and (Ak ⊗ Im)T kmm . In other words, this optimization was also hard-coded

specifically for Cooley-Tukey based algorithms.

Summary. In summary, both FFTW and the original Spiral handle (2.24) as a special case, anapproach that is neither easily extensible, nor one that gives any insight into how to optimize theother FFT recursions or other transforms. To solve this problem we propose a lower level extensionto SPL, called

∑-SPL.

Next, after explaining∑

-SPL and the loop merging optimizations, we show how to apply themto other breakdown rules, namely the FFTs (2.25) and (2.26). We have also used this frameworkto generate fast loop code for other transforms, which established the first column in Table 1.1(b).

2.4.3 Loop Merging Example

Before we formally introduce∑

-SPL we continue with the example (2.23) from the beginning ofSection 2.4 and show the main ingredients of

∑-SPL and how the actual loop merging is performed.

The SPL describes transform algorithms as sparse structured matrix factorization built fromsmall matrices, permutation matrices, diagonal matrices, tensor products, and other constructs(Section 2.2). SPL captures the data flow of an algorithm. However, all data reorganization stepsare described explicitly as stages that perform passes through the data.

The formula(I4 ⊗ F2)L

84 (2.30)

from Section 2.4.1 describes two passes through the data. We show this graphically in Fig. 2.2(a).First, the data vector is shuffled according to L8

4, and then the loop corresponding to I4 ⊗ F2


applies the computational kernel F2 to consecutive subvectors. Our goal is to merge these twoloops into a single loop that implements the data reorganization as readdressing of the input of thecomputational kernels F2. This is done using

∑-SPL and a small number of rewrite rules, leading

to the result in Fig. 2.2(b).

First, we translate (2.30) into∑

-SPL (no loop merging is performed) to get

3∑

j=0

S(h2j, 1)F2G(h2j, 1)

perm(ℓ84). (2.31)

In (2.31) the permutation matrix L84 is expressed by its underlying permutation ℓ8

4. The sumexpresses the four iterations of the loop for I4 ⊗ F2. The gather and scatter matrices G() and S()express respectively the loading and storing of the data in the loop. The actual loop merging isnow performed by rewriting (2.31) twice as

(2.31) →3∑

j=0

(

S(h2j, 1)F2G(ℓ84 h2j, 1)

)

→3∑

j=0

(

S(h2j, 1)F2G(hj, 2))

. (2.32)

In (2.32), first the permutation was fused into the gather operation leading to a new gather indexmapping. Then the composed index mapping function ℓ8

4 h2j, 1 was simplified to hj, 2.

2.4.4∑

-SPL: Definition

In the example in Fig. 2.2 the main computational loop was described by I4 ⊗ F2 and the datashuffling L8

4 was combined with the adjacent computation. This motivates our definitions of twocategories of SPL constructs: decoration and skeleton.

The main computation loops in an SPL formula are defined by its skeleton. Examples of skeletonobjects include direct sums and tensor products with identity matrices,

A⊕B, A⊗ In, and In ⊗A.

When an SPL formula is mapped into∑

-SPL, skeleton objects are translated into iterative sumsand gather/scatter operations as in (2.31) to make the loop structure and the index mappingsexplicit. In (2.30) the skeleton is I4 ⊗ F2.

Permutation matrices and diagonal matrices are called decorations. Our optimizations mergedecorations into the skeleton such that the extra stages disappear. In (2.30) the decoration isL8

4 = perm(ℓ84) and the gather index mapping resulting from the loop merging in (2.32) is ℓ8

4 h2j, 1,which becomes hj, 2 after simplification.

With the above example as a motivation, we now provide the formal definition of∑

-SPL,starting with an overview of its main components.

Main components.∑

-SPL consists of four components:

1. index mapping functions,

2. scalar functions,

3. parametrized matrices,

2.4.∑


4. iterative sum∑

.

These are defined next.

Index mapping functions. An index mapping function is a function mapping an intervalinto an interval. For example, we saw in (2.31) that gather, scatter, and permutation matricesare parameterized by such functions. We express index mapping functions in terms of primitivefunctions and function operators to capture their structure. This enables many simplifications,which are exceedingly difficult on the corresponding C code.

An integer interval is denoted by In = 0 . . . , n − 1, and an index mapping function f withdomain In and range IN is denoted with

f : In → IN ; i 7→ f(i).

We use the short-hand notation fn→N to refer to an index mapping function of the form f : In → IN .If index mapping functions depend on a parameter, say j, we write fj. A bijective index mappingfunction

p : In → In; i 7→ p(i),

defines a permutation on n elements and is abbreviated by pn.

We introduce two primitive functions, the identity function and the stride function respectivelydefined as

ın : In → In; i 7→ i, (2.33)

hb, s : In → IN ; i 7→ b + is, for s|N. (2.34)

Structured index mapping functions are built from the above primitives using function composition,and possibly other function operators. For the two index mapping functions fm→M and gn→N withn = M , we define the function composition in the usual way:

g f : Im → IN ; i 7→ g(f(i)).

Scalar functions. A scalar function

f : In → C; i 7→ f(i),

maps an integer interval to the domain of complex or real numbers, and is abbreviated by fn→C.Scalar functions are used to describe diagonal matrices.

Parametrized matrices.∑

-SPL contains four types of parameterized matrices, described bytheir defining functions:

G(rn→N ), S(wn→N ), perm(pn), and diag(fn→C

).

Their interpretation as C code is summarized in Table 2.5(a).

Let enk ∈ C

n×1 be the column basis vector with the 1 in k-th position and 0 elsewhere. Thegather matrix for the index mapping fn→N is

G(fn→N ) :=[

eNf(0) | eN

f(1) | · · · | eNf(n−1)

]⊤.


Formula M Code for y = Mx

G(fn→N )for (j=0; j<n; j++)

y[j] = x[f(j)];

S(fn→N )for (j=0; j<n; j++)

y[f(j)] = x[j];

perm(pn)for (j=0; j<n; j++)

y[j] = x[p(j)];

diag(fn→C

) for (j=0; j<n; j++)

y[j] = f(j)*x[j];

(∑k−1

j=0 Aj

) for (j=0; j<k; j++)

<code for : y = A_j * x>

Table 2.5: TranslatingP

-SPL constructs to code.

Permutation and scatter matrices are defined as follows; (·)⊤ is the matrix transposition.

perm(fn) := G(f) =[

enf(0) | en

f(1) | · · · | enf(n−1)

]⊤,

S(fn→N) := G(f)⊤ =[

eNf(0) | eN

f(1) | · · · | eNf(n−1)

]

.

This implies that

y = G(f) · x ⇔ yi = xf(i),

y = perm(f) · x ⇔ yi = xf(i),

y = S(f) · x ⇔ yf(i) = xi,

which explains the corresponding code in Table 2.5(a).

Gather matrices are wide and short, and scatter matrices are narrow and tall. For example,

G(h0,1) =

[1

. . .1

]

, S(h0,1) = G(h0,1)⊤.

Scalar functions like fn→C define n× n diagonal matrices

diag(fn→C

)= diag

(f(0), . . . , f(n− 1)

).

Iterative sum. The iterative sumn−1∑

j=0

Aj .

is used to represent loops.

In∑

-SPL the summands Aj of an iterative sum are constrained such that the actual additions(except those incurred by the Aj) are never performed, i.e., no two matrices Aj1 and Aj2 have anon-zero entry at a common position (i, k)

2.4.∑


The interpretation of the iterative sum as a loop makes use of the distributivity of the sum,

y =

n−1∑

j=0

Aj

x =n−1∑

j=0

(Ajx).

Due to the constraint that the iterative sum actually does not incur any additional operation, itencodes a loop in which each iteration produces a unique part of the final output vector.

Now, as an example, we show how ⊗ can be converted into a sum. We assume that A is n× nand omit the domain and range in the occurring stride functions for simplicity.

Ik ⊗A =

[A

.. .A

]

=

[A

]

+ · · ·+[

A

]

= S0AG0 + · · ·+ Sk−1AGk−1

= Sh0,1AGh0,1 + · · ·+ S(k−1)n,1AG(k−1)n,1

=k−1∑

j=0

Shjn,1AGhjn,1

For example, in this equation,

G0 = Gh0,1 =

[1

. . .1

]

, S0 = S(h0,1) = G(h0,1)⊤.

Intuitively, the conversion to∑

-SPL makes the loop structure of y = (Ik ⊗ A)x explicit. In eachiteration j, G(·) and S(·) specify how to read and write a portion of the in and output, respectively,to be processed by A.

The code for y =(∑n−1

j=0 Aj

)

x is shown in Table 2.5(b).∑

-SPL example. As explained in the beginning of this section, we use iterative sums to makethe loop structure of skeleton objects in SPL explicit. The summands are sparse matrices thatdepend on the summation index. For example I2 ⊗ F2 becomes in

∑-SPL the sum

[F2 02

02 F2

]

=

[F2 02

02 02

]

+

[02 02

02 F2

]

. (2.35)

With the definition of the scatter and gather matrices

S0 =[I2| 02

]⊤and S1 =

[02 |I2

]⊤,

G0 =[I2| 02

]and G1 =

[02 |I2

],

(2.35) can be written as the iterative sum

1∑

j=0

SjF2Gj . (2.36)

However, in (2.36) the subscripts of S and G are integers and not functions. Using the stride index


mapping function in (2.34) we express the gather and scatter matrices as

Sj = S(h2j, 1) and Gj = G(h2j, 1).

The matrix-vector product y = (I2 ⊗ F2)x then becomes in∑

-SPL

y =

1∑

j=0

S(h2j, 1)F2G(h2j, 1)x. (2.37)

Using the rules in Table 2.5 we obtain the unoptimized program in Implementation 1.

Implementation 1 (Unoptimized program for (2.37))

// Input: _Complex double x[4], output: y[4]

_Complex double t0[2], t1[2];

for ( int j=0;j<2;j++)

for ( int i=0; i<2; i++) t0[i] = x[i+2*j]; // G(h_(2j ,1))

t1[0] = t0[0] + t0[1]; // F_2

t1[1] = t0[0] - t0[1]; //

for ( int i=0; i<2; i++) y[i+2*j] = t1[i]; // S(h_(2j ,1))

After standard optimizations (performed by the standard SPL compiler), such as loop unrolling,array scalarization, and copy propagation, we obtain the optimized program in Implementation 2.

Implementation 2 (Optimized program for (2.37))


for ( int j=0;j<2;j++)

y[2*j] = x[2*j] + x[2*j+1];

y[2*j+1] = x[2*j] - x[2*j+1];

2.4.5 The∑

-SPL Rewriting System

In this section we describe the new loop optimization procedure and its implementation. Theoptimizations are implemented as rewrite rules, which operate in several passes on SPL and

∑-

SPL expression trees. An overview of the different steps is shown in Figure 2.3. The steps aregeneric for all transforms and algorithms except the index simplification (marked with (*)), whichrequires the inclusion of rules specific to the class of algorithms considered. Namely, these are therules that simplify the composition of index mapping functions, for example (2.33) and (2.34). Westart with an overview; then we explain the steps in detail.

1. Expansion of skeleton. In the first step we translate an SPL formula (as generated withinSpiral) into a corresponding

∑-SPL formula. The skeleton (see Section 2.4.4) is expanded

into iterative sums and the decorations are expressed in terms of their defining functions.

2. Loop merging. A generic set of rules merges the decorations into neighboring iterative sums,thus merging loops. In this process index mapping functions get symbolically composed, andthus become complicated or costly to compute.

2.4.∑


SPL

Expansion of skeleton

Loop merging

Index simplification (*)

Code generation

Code

∑-SPL

∑-SPL

∑-SPL

Figure 2.3: Loop merging and code generation for SPL formulas. (*) marks the transform/algorithm specific pass.

3. Index simplification. In this stage the index mapping functions are simplified, which is pos-sible due to their symbolic representation and a set of symbolic rules. The rules of thisstage are transform dependent and encode the domain specific knowledge how to handle spe-cific permutations. Many identities are based on simple number theoretic properties of thepermutations. Identifying these rules for a given transform is usually a research problem.

4. Code generation. After the structural optimization the∑

-SPL compiler is used to translatethe final expression into initial code (see Table 2.5), which is then further optimized as in theoriginal SPL compiler [137].

In the following detailed explanation, we use the following simple formula as a running example:

(Im ⊗DFTk) Lmkm , Lmk

m = perm(ℓmkm ), (2.38)

where ℓmkm was defined in (2.27).

The direct translation of (2.38) into code using Table 2.4 for m = 5 and k = 2 would lead tothe code fragment in Implementation 3.

Implementation 3 (Compute y = (I5 ⊗ F2)L105 without loop merging )


_Complex double t[10];

// explicit stride permutation L^10_5

for ( int i=0; i<10; i++)

t[i] = x[(i/5) + 2*(i % 5)];

// kernel loop I_5 x DFT_2

for ( int i=0; i<5; i++)

y[2*i] = t[2*i] + t[2*i+1];

y[2*i+1] = t[2*i] - t[2*i+1];


A⊕B → S(h0, 1)AG(h0, 1) + S(hm, 1)BG(hn, 1) (2.39)

A⊗ Ik →k−1∑

j=0

S(hj, k)AG(hj, k) (2.40)

Ik ⊗A →k−1∑

j=0

S(hmj, 1)AG(hmj, 1). (2.41)

Table 2.6: Rules to expand the skeleton.

This code contains two loops, originating from the permutation matrix Lmkm and from the compu-

tational kernel Im ⊗DFTk. The goal in this example is to merge the two loops into one.Step 1: Expansion of skeleton. This stage translates SPL formulas, generated by Spiral,

into∑

-SPL formulas. It rewrites all skeleton objects in the SPL formula into iterative sums andmakes the defining functions of decorations explicit. Table 2.6 summarizes the rewrite rules usedin this step, assuming A ∈ C

m×n, B ∈ Cm′×n′

.Both tensor products and direct sums of matrices are translated into sums with stride index

mapping functions parameterizing the gather and scatter matrices.In our example (2.38), the rewriting system applies the rule (2.41) to obtain the

∑-SPL ex-

pression

m−1∑

j=0

S(hnj, 1)DFTn G(hnj, 1)

perm(ℓmnm ). (2.42)

Step 2: Loop merging. The goal of loop merging is to propagate all index mapping functionsfor a nested sum into the parameter of one single gather and scatter matrix in the innermostsum, and to propagate all diagonal matrices into the innermost computational kernel. Table 2.7summarizes the necessary rules. Loop merging consists of two steps: moving matrices into iterativesums, and actually merging or commuting matrices.

First, matrices are moved inside iterative sums by applying rules (2.44) and (2.45) that imple-ment the distributive law. Note, that iterative sums are not moved into other iterative sums.

Second, the system merges gather, scatter, and permutation matrices using rules (2.46)–(2.49)and pulls diagonal matrices into the computational kernel using (2.50) and (2.51). This step simplycomposes their defining functions, possibly creating complicated terms that must be simplified inthe next step.

After loop merging, (2.42) is transformed into

m−1∑

j=0

(

S(hkj, 1)DFTk G(ℓmkm hkj, 1)

)

(2.43)

through application of rules (2.44) and (2.46).Step 3: Index mapping simplification. As said before, this is the only step that depends

on the considered transform and its algorithms. We consider the DFT and the FFTs (2.24)–(2.26).

2.4.∑


m−1∑

j=0

Aj

M →

m−1∑

j=0

AjM

(2.44)

M

m−1∑

j=0

Aj

→

m−1∑

j=0

MAj

(2.45)

G(sn→N1)G(rN1→N ) → G(r s) (2.46)

S(vN1→N )S(wn→N1) → S(v w) (2.47)

G(rn→N ) perm(pN ) → G(p r) (2.48)

perm(pN )S(wn→N ) → S(p−1 w) (2.49)

G(rn→N ) diag(fN→C

)→ diag

(f r

)G(r) (2.50)

diag(fN→C

)S(wn→N ) → S(w) diag

(f w

)(2.51)

Table 2.7: Loop merging rules.

These recursions feature permutations that involve integer power computations, modulo operations,and conditionals. In addition to the stride permutation (2.27) in the Cooley-Tukey FFT (2.24),the prime-factor FFT (2.25) uses Vm,k = perm(vm,k) with gcd(m,k) = 1, defined in (2.28), and forprime n the Rader FFT (2.26) requires Wr,s = perm(wn

g ), with g a generator of the multiplicativegroup (Z/nZ)×, defined in (2.29).

If different breakdown rules are applied at different levels of DFT recursion, the rewrite rules(2.46)–(2.49) can yield a composition of many index mapping functions. Since individual indexmappings are already rather complicated, the composed mappings become very complex. Thus,our rewriting system must be able to perform powerful index mapping simplifications.

In order to formulate all simplification rules for ℓ, v, w in (2.27)–(2.29) we first have to introducetwo helper functions:

zb, s : i 7→ b + is mod N, N = sn, (2.52)

wn→Nϕ,g : i 7→ ϕgi mod N, n|N − 1, N prime (2.53)

The power of using symbolic functions and operators becomes apparent next. Namely, we canidentify a rather small set of context insensitive simplification rules to simplify all index functionsarising from combining the FFT rules (2.24)–(2.26). The required simplifications cannot be donesolely using basic integer identities as conflicting identities and special number-theoretical con-straints (e.g., a variable is required to be the generator of a cyclic group of order n) would berequired. Our function symbols capture these conditions by construction while our rules encodethe constraints.

Table 2.8 summarizes the majority of index simplification rules used for the FFTs. These ruleswere originally identified in [52], although there we used a different notation for some index mappingfunctions, which required some additional rules. Rules (2.54)–(2.56) are used to simplify Cooley-Tukey FFT decompositions. Rules (2.57)–(2.60) are used for decompositions based on prime factorFFT. Rules (2.61)–(2.64) are used for the Rader FFT.


For Cooley-Tukey FFT:

ℓmkm hk→mk

kj,1 → hk→mkj,m (2.54)

(ℓmkm

)−1 → ℓmkk (2.55)

hN→N ′

b′,s′ hn→Nb,s → hn→N ′

b′+s′b,s′s (2.56)

For prime-factor FFT:

vm,k hk→mkkj,1 → zk→mk

j,m (2.57)

vm,k hk→mkj,m → zk→mk

j,m (2.58)

zN→N ′

b′,s′ hn→Nb,s → zn→N ′

b′+s′b,s′s, N = sn (2.59)

zN→N ′

b′,s′ zn→Nb,s → zn→N ′

b′+s′b,s′s (2.60)

For Rader FFT:

wNg h1→N

0,1 → h1→N0,1 (2.61)

wNg hN−1→N

N−1,1 → wN−1→N1,g (2.62)

wN ′→Nϕ,g hn→N ′

b,s → wn→Nϕgb,gs (2.63)

wN ′→Nϕ,g zn→N ′

b,s → wn→Nϕgb,gs (2.64)

Table 2.8: Index function simplification rules.

To illustrate the complexity of the algebraic simplifications that some of these rules encode, weconsider the rule (2.59) as an example. The underlying algebraic transformation can be obtainedby plugging the definitions of h and z from (2.34) and (2.52) into the rule. The right hand side ofthe rule becomes

(b′ + s′(b + si)) mod N ′ = (b′ + s′b + s′si) mod N ′

The above is equivalent to zb′+s′b, s′s only if the constraints in (2.52) hold. Namely, we must verifythat N ′ = s′sn. We have N ′ = s′N (constraint for zb′, s′), and N = sn from (2.59), combining thesetwo facts, we obtain N ′ = s′sn as desired. The constraint in (2.52) is not arbitrary, it is requiredfor (2.60).

Continuing with our running example, the index function simplification applies rule (2.54) to(2.43) to obtain

m−1∑

j=0

(S(hnj, 1)DFTn G(hj, m)) . (2.65)

Step 4: Code generation. The∑

-SPL compiler first generates unoptimized code for an∑

-SPL formula using the context insensitive mappings given in Table 2.5. As in the originalSPL compiler [137], an unrolling parameter to the

∑-SPL compiler controls which loops in a

∑-

2.4.∑


SPL formula will be fully unrolled and thus become basic blocks in which all decorations andindex mappings are inlined. Our approach relies on further basis block level optimizations toproduce efficient code. These are a superset of the optimizations implemented in the original SPLcompiler, including 1) full loop unrolling of computational kernels, 2) array scalarization, 3) constantfolding, 4) algebraic strength reduction, 5) copy propagation, 6) dead code elimination, 7) commonsubexpression elimination, 8) loop invariant code motion, and 9) induction-variable optimizations.

Translating our running example (2.65) into optimized code for m = 5, n = 2 using the∑

-SPLcompiler leads to Implementation 4.

Implementation 4 (Compute y = (I5 ⊗ F2)L105 following (2.65) (with loop merging) )


for ( int j=0; j<5; j++)

y[2*j] = x[j] + x[j+5];

y[2*j+1] = x[j] - x[j+5];

Compared to Implementation 3 this code only consists of one loop, as desired.

2.4.6 The∑

-SPL Rewriting System: Rader and Prime-Factor Example

To demonstrate how our framework applies to more complicated SPL formulas, we show the stepsfor compiling a part of a 3-level recursion for a non-power-of-2 size DFT that uses all three differentrecursion rules (2.24)–(2.26), namely a DFTpq with q prime, and q − 1 = rs (for example, p = 4,q = 7, r = 3, and s = 2 is a valid combination). Initially the prime-factor decomposition (2.25)is applied for the size pq, then the Rader decomposition (2.26) decomposes the prime size q intoq − 1. Finally, a Cooley-Tukey step (2.24) is used to decompose q − 1 into rs. We only consider afragment of the resulting SPL formula

(Ip ⊗ (I1 ⊕ (Ir ⊗DFTs)Lrsr )Wq) Vp,q.

This formula has three different permutations, and a naive implementation would require threeexplicit shuffle operations, leading to three extra passes through the data vector.

Our rewriting system merges these permutation matrices into the innermost loop and thensimplifies the index mapping function, effectively reducing the number of necessary mod operationsto approximately 3 mods per 2 data points and improving the locality of the computation at thesame time. Below we show the intermediate steps.

Steps 1 and 2: Expansion of skeleton and loop merging. Rules (2.41) and (2.39) expandthe skeleton. Then, the rules in Table 2.7 are used to merge the loops and thus move the decorationsinto the innermost sum. The resulting

∑-SPL formula is

p−1∑

j1=0

(

S(hqj1, 1 h0, 1 ı1)G(vp,q hqj1, 1 wqg h0, 1)

+r−1∑

j0=0

S(hqj1, 1 h1, 1 hsj0, 1)DFTs G(vp,q hqj1, 1 wqg h1, 1 ℓrs

r hsj0, 1)

)

.


Note that the occurring index functions are very complicated.Step 3: Index mapping simplification. Using only the rules in Table 2.8, the gather and

scatter index mapping functions are simplified to produce the optimized formula

p−1∑

j1=0b1=qj1

(

S(h0, q (j1)p)G(z0, q (j1)p) +

r−1∑

j0=0φ1=gj0

S(hqj1+sj0+1, 1)DFTs G(zb1, p ws→qφ1,gs)

)

. (2.66)

Step 4: Code generation. Now the code can be generated for some specific values of p, q, rand s using the

∑-SPL compiler. Below we show optimized code for p = 4, q = 7, r = 3, s = 2.

Implementation 5 (Compute (I4 ⊗ (I1 ⊕ (I3 ⊗DFT2)L63)W7)V4,7 following (2.66). )


int p1 , b1;

for ( int j1 = 0; j1 <= 3; j1++)

y[7*j1] = x[(7*j1%28)];

p1 = 1; b1 = 7*j1;

for ( int j0 = 0; j0 <= 2; j0++)

y[b1 + 2*j0 + 1] =

x[(b1 + 4*p1 )%28] + x[(b1 + 24*p1)%28];

y[b1 + 2*j0 + 2] =

x[(b1 + 4*p1 )%28] - x[(b1 + 24*p1)%28];

p1 = (p1 *3%7);

2.4.7 Final Remarks

We presented an extension of the SPL language and compiler used in Spiral to enable loop op-timization for signal transform algorithms at the formula level. The approach concurs with theSpiral philosophy which aims to perform optimizations at the “right level” of abstraction. The“right level” depends on the type of optimization and is usually a research question. For the loopoptimizations considered here, we found that the right level is between SPL and the actual code asreflected by

∑-SPL, which, unlike SPL, represents loops and index mappings explicitly and com-

pactly. Our general approach still requires us to find the proper set of index function simplificationrules, which we did for the FFTs (2.25)–(2.26).

The general loop merging solves one of the three fundamental limitations of Spiral, explainedin Section 2.3.

Besides, loop optimizations, the formal∑

-SPL framework developed here turns out to be thecrucial tool for further research in formula level algorithm transformations. In particular, in thenext section we will build upon

∑-SPL to enable general size library generation with Spiral.

Chapter 3

Library Generation: LibraryStructure

In this section we address the main challenge posed in this thesis: the automatic generation ofgeneral size library code as sketched in Table 1.2(d). The overall goal of the thesis is to completelyfill the fourth row of Table 1.1(c), but here we will talk only about the first entry in the row, namelyabout scalar library generation. The generation of vectorized and parallel libraries is explained inChapter 4.

To generate a library for a given transform (e.g., DFTn) we need to be able to generate code fortransforms of symbolic size n, i.e., the size becomes an additional parameter. This is fundamentallydifferent from fixed-size code. For example, fixed-size transforms can be completely decomposedby Spiral (Section 2.3) at code generation time using suitable breakdown rules, but if the size isunknown, the breakdown rules can only be applied at runtime, since different rules have differentapplicability conditions based on the transform size. Thus, to generate a general size library wewill need to “compile” the breakdown rules. As we explain in this chapter our approach is basedon compiling a set of breakdown rules into a so-called recursion step closure, which corresponds toa set of mutually recursive functions, and can be translated into code as explained in Chapter 5.

3.1 Overview

Our goal is to be able to generate different kinds of libraries, that may be desirable for differentapplications. Examples include adaptive libraries (similar to FFTW) or fixed libraries (similar toIntel IPP). Depending on the library kind and the library infrastructure already in place, the gen-erated code must look differently. We will collectively call the desired library type, infrastructure,and the target language the library target or simply target.

The basic idea of our library generation approach decouples target dependent and target inde-pendent tasks and is shown in Fig. 3.1. The library generator consists of two high-level modules.The first module, Library Structure (described in this chapter and extended in Chapter 4 to includeparallelism framework), operates at the algorithm level, and is independent of the library target.The second module, Library Implementation (described in Chapter 5, operates at the code (imple-mentation) level, and is dependent on the library target. The Library Structure module producesthe so-called recursion step closure. Informally, the closure describes the set of functions (whichpossibly call each other) that compute the needed transforms. The closure is expressed as a set of

37

38 CHAPTER 3. LIBRARY GENERATION: LIBRARY STRUCTURE

Figure 3.1: Library generation in Spiral.

∑-SPL formulas. The Library Implementation module then uses the closure and the information

about the library target to produce the final library implementation.

Our approach separates transform algorithm generation from the library implementation details.For exa mple, if a new transform is requested, and the Library Structure module can generate theclosure, the unmodified Library Implementation module can immediately generate several kinds oflibraries, adaptive or fixed.

The library generation procedure is completely deterministic, and does not involve any feedback-driven search at code generation time, which a lot of previous work on Spiral was centered around.Instead, it moves the platform adaptation to runtime, by generating libraries with feedback-drivenadaptation mechanisms, similar to FFTW and UHFFT. However, this does not preclude search atcode generation time, for example, for the generation of fixed-size library base cases (codelets).

We reuse the transforms, the domain-specific language SPL, and the transform breakdownrules provided with the original Spiral. The rest of the required library generation infrastructure,including

∑-SPL, rewrite rules, and the

∑-SPL compiler is part of the contribution of this thesis.

In the rest of this chapter we will discuss the Library Structure module, based on SPL and∑

-SPL. In particular, the∑

-SPL framework introduced in the previous section is the crucial toolin the library generation process. First, in Section 3.2 we will introduce the problem of recursivecode generation and explain why is it more difficult than generating single size code. Then, inSections 3.3–3.5 we will walk through a simple example of generating the recursive code for theCooley-Tukey FFT, and then explain how the process can be automated, and explain all the majorsteps of the general algorithm that pertain to the Library Structure module, leaving out the code-level details to the Library Implementation described in Chapter 5.

Next, in Sections 3.6–3.9 we will explain the crucial extension to∑

-SPL called the index-free∑

-SPL , which enables a number of important loop optimizations.

3.2. MOTIVATION AND PROBLEM STATEMENT 39

3.2 Motivation and Problem Statement

As a motivating example, we consider a simple recursive implementation of the Cooley-Tukey FFTbreakdown rule, as applied to DFTn, with unknown composite n, and divisor k|n. We restatethe breakdown rule from (2.1), replacing m = n/k and Dk,m by diag(d), where dn→C is the scalarfunction that provides the diagonal entries of Dk,m.

DFTn = (DFTk⊗In/k) diag(d)(Ik ⊗DFTn/k)Lnk . (3.1)

Pseudo code for the computation of y = DFTn x following (3.1) could look as follows.

Implementation 6 (Compute y = DFTn x following (3.1))

void dft( int n, complex *y, complex *x)

int k = choose_factor(n);

complex *t1 , *t2, *t3;

// t1 = L^n_k*x

t1 = Permute x with L(n,k);

// t2 = (I_k tensor DFT_n/k)*t1

for ( int i=0; i<k; ++i)

dft(n/k, t2 + n/k*i, t1 + n/k*i);

// t3 = diag( d(j) )*t2

for ( int i=0; i<n; ++i)

t3[i] = d(i) * t2[i];

// y = (DFT_k tensor I_n/k)*t3

// cannot call dft() recursively , need strided I/O

for ( int i=0; i<n/k; ++i)

dft_str(k, n/k, y + i, t3 + i); // stride = s

void dft_str( int n, int stride, complex *y, complex *x)

// has to be implemented

The main problem here is that the straightforward implementation of the DFT function is not self-contained. First, it requires the precomputed diagonal elements d(i). Second, and more important,(DFTk ⊗In/k) requires the computation of the DFT on strided data; this means that we also needa new kind of DFT function, dft str, parametrized by the stride. Thus, even in this straightfor-ward implementation, an additional function that accommodates an extra parameter needs to beimplemented.

In a real implementation, the situation can get much more complicated, since optimizationsmay introduce the need for a larger set of additional functions.

We will call such functions recursion steps. The original transform function (here: dft) is alsoa recursion step. Informally, a recursion step implements a transform with some context, whichmay include, but is not limited to, gather and scatter operations (as in the case of dft str above),


diagonal scaling, and loops. Note, that the name “recursion step” does not necessarily implyrecursion. Even an iterative algorithm will require recursion steps in the sense of our definition.

For example, FFTPACK [114–116] internally uses 6 different DFT recursion steps. Each ofthem corresponds to different stages of a mixed-radix iterative algorithm. The fast iterative split-radix FFT implementation by Takuya Ooura [92] uses 28 recursion steps to implement both thecomplex DFT and the real-valued DFT (RDFT). FFTW 3.x [61] uses more than 20.

Here we consider the less sophisticated FFTW 2.x [60] for an easier comparison. In FFTW 2.xtwo different variants of the DFT function are used. The Cooley-Tukey FFT (3.1) is computed intwo loops corresponding to the two tensor products and the diagonal and the stride permutationin (3.1) are respectively fused with these loops. Conceptually this implies the following partitionof the SPL breakdown rule:

DFTn = (DFTk ⊗In/k) diag(d)︸︷︷︸

loop

(Ik ⊗DFTn/k)Lnk

︸︷︷︸

loop

.

Fusion of the stride permutation Lnk with the first loop requires a DFT function with different

input and output stride, and the fusion of the diagonal diag(d) with the second loop requires aDFT function with an extra array parameter d, which holds the scaling factors. The correspondingpseudo code is given in Implementation 7. One can think of this implementation, as a recursiveequivalent of the

∑-SPL loop merging shown in Section 2.4.

Implementation 7 (FFTW 2.x-like implementation of (3.1) )

void dft( int n, complex *Y, complex *X) int k = choose_factor(n)

// y = (I_k tensor DFT_n/k)L(n,k)*x


dft_str(n/k, k, 1, y + (n/k)*i, x + (n/k)*i);

// y = (DFT_k tensor I_n/k) diag(d(j)) y


dft_scaled(k, n/k, precomputed_d[i], y + i, y + i);

void dft_str( int n, int in_stride , int out_stride , complex *y, complex *x)


void dft_scaled( int n, int stride, complex *d, complex *y, complex *x)


In FFTW 2.x, the function dft str is implemented recursively based on (3.1) similarly to Imple-mentation 7. It calls itself and dft scaled, and does not require any new functions. Eventually,the DFT size becomes sufficiently small, and the recursion terminates using a so-called codelet (afunction that computes a small fixed-size DFT) for dft str. dft scaled is not recursive in FFTW2.x, and is always implemented as a codelet, called twiddle codelet in this case.

We say that the functions dft, dft str, and dft scaled form a so-called recursion step closure,which is a minimal set of functions sufficient to compute the transform. The recursion step closure

3.2. MOTIVATION AND PROBLEM STATEMENT 41

DFT1024 (strided)

(scaled, codelet) DFT16 DFT64 (strided)

(scaled, codelet) DFT16 DFT16 (strided, codelet)

Figure 3.2: Possible call tree for computing DFT1024 in FFTW 2.x using Implementation 7 (functions names donot correspond to actual FFTW functions).

is the central concept in our library generation framework. Later in this chapter we will show howto compute the closure automatically.

The codelets in FFTW are functions that implement dft str and dft scaled for a fixed, smallvalue of n. The library provides codelets for several small values of n ≤ 64. We can visualize theFFTW recursion graphically, using a call tree. For example, Fig. 3.2 shows a possible recursion forDFT1024. Since dft scaled must be a codelet, only rightmost trees are supported. The array ofprecomputed diagonal elements precomputed d in dft scaled is created in an initialization stageusing the rest of the library infrastructure.

Besides loop merging FFTW 2.x implements also the following optimizations:

• Buffer reuse. The second step, dft scaled is performed inplace, reusing the y array, whichwe call a “buffer”. This eliminates the need for extra buffers, beyond the ones provided by thelibrary user. This optimization minimizes memory usage and improves cache performance.In addition, this optimization allows dft scaled to take only a single stride as parameter (incontrast to separate input and output strides as needed for dft str).

• Looped codelets. The codelet versions of dft scaled and dft str include the immediateouter loops. This improves performance, since it eliminates extra function call overhead, andenables the C compiler to reuse certain address computations and perform loop invariant codemotion.

At the hand of the example above, we can now state the general problem of generating recursivelibrary code for a single breakdown rule.

Problem 1 (Library generation for a single breakdown rule) Given: A transform T and a break-down rule that decomposes T into transforms of the same type (as in (3.1)). Tasks:

1. Find the recursion step closure, i.e., the minimal set of needed recursion steps (functions).

2. Implement each recursion step using again the given breakdown rule.

3. Generate the base cases (codelets) for the recursion steps for several small fixed sizes.

4. Combine the recursion steps and codelets into a library infrastructure.

Tasks 1–3 are performed by the “Library Structure” block in Fig. 3.1, and are the focus of therest of this chapter. Task 4 is dependent on the desired library infrastructure and is handled bythe “Library Implementation” block and explained in Chapter 5.


We will proceed by showing a detailed informal example of manually performing the first twotasks in the procedure above. Then, we explain the general procedure, explain how to generatebase cases, and how to incorporate buffer reuse and looped codelets.

3.3 Recursion Step Closure

The first step in the library generation process is the computation of the recursion step closure.The closure provides a rough sketch of the library, and is the central object throughout the thesis.

3.3.1 Example

We will walk through the formal derivation of the required recursion steps for the Cooley-TukeyFFT (3.1) using

∑-SPL and explain their implementation. These are tasks 1 and 2 of Problem 1.

Informally, the procedure consists of several steps:

1. Apply the breakdown rule to the transform T to obtain an SPL formula.

2. Convert the SPL formula into∑

-SPL using Table 2.6.

3. Apply loop merging rewriting rules (Table 2.7).

4. Apply index simplification rules (Table 2.8).

5. Extract the required recursion steps from the∑

-SPL formula (we will show that this ispossible using rewriting).

6. Repeat this procedure for each recursion step until closure is reached, i.e., no new recursionsteps appear.

First, we define the recursion step tag. Given a formula F ,F

means that F is to be im-plemented as a recursion step, i.e., a separate function call. We will not tag arbitrary F , butonly transforms and transforms with

∑-SPL decorations (i.e., gather, scatter, diagonal matrices,

as explained in Sec. 2.4.4). Intuitively, the tag is the∑

-SPL equivalent of a function call, andthe boundaries of such function calls are shown by tagging larger fragments of formulas. This willbecome clearer below.

Now, we proceed with the example.

Example 1 (Find and implement recursion steps for the Cooley-Tukey FFT) Given: DFTn and Cooley-Tukey FFT (3.1). Tasks:

• Find the recursion step closure, i.e., the minimal set of needed recursion steps (functions).

• Implement each recursion step using again the given breakdown rule.

We start with DFTn and follow the six steps explained above.

Step 1: Apply the breakdown rule. First, we apply the breakdown rule to DFT. Thesmaller transforms are tagged as

DFT

. We obtain

DFTn

= (DFTk

⊗ Im) diag(d)(Ik ⊗

DFTm

)Ln

k . (3.2)

3.3. RECURSION STEP CLOSURE 43

Step 2: Convert to∑

-SPL. Using SPL we cannot merge the permutation and the diagonalwith the tensor products as explained in Section 2.4. Thus, in the next step we convert (3.2) to∑

-SPL. This is done by applying the rules from Table 2.6 and yields

(k−1)∑

i=0

S(hi, k)DFTn/k

G(hi, k)

diag(d)

(n/k−1)∑

j=0

S(hjk, 1)DFTk

G(hjk, 1)

perm(ℓnn/k).

(3.3)

Step 3: Apply loop merging rewrite rules. We apply the rules from Table 2.7 to mergein (3.3) the permutation and the diagonal with the adjacent loops. The result is

(k−1)∑

i=0

S(hi, k)DFTn/k

diag

(d hi, k

)G(hi, k)

(n/k−1)∑

j=0

S(hjk, 1)DFTk

G(ℓn

n/k hjk, 1)

.

(3.4)

Step 4: Apply index simplification rules. We apply the rewriting rules from Table 2.8 tosimplify the index mapping functions in (3.4), and obtain the formula

(k−1)∑

i=0

S(hi, k)DFTn/k

diag

(d hi, k

)G(hi, k)

(n/k−1)∑

j=0

S(hjk, 1)DFTk

G(hj, n/k)

. (3.5)

The only affected construct is the rightmost gather G(), in which the composed index mapping wassimplified.

Step 5: Extract the required recursion steps. At this point, we effectively performed loopmerging for one step of the Cooley-Tukey FFT. To further merge loops recursively for the smallerDFTn/k and DFTk we push the decoration constructs inside these recursion steps, effectivelycreating more powerful recursion steps. Formally, this is done by expanding the scope of therecursion step tag, i.e., by simply moving all of the adjacent decoration constructs inside therecursion step tag (curly braces):

(k−1)∑

i=0

S(hi, k)DFTn/k diag

(d hi, k

)G(hi, k)

(n/k−1)∑

j=0

S(hjk, 1)DFTk G(hj, n/k)

. (3.6)

In (3.6) we can already see the dual loop structure, and the following two distinct recursion steps:

S(hi, k)DFTn/k diag

(d hi, k

)G(hi, k)

and (3.7)

S(hjk, 1)DFTk G(hj, n/k)

. (3.8)

In both recursion steps the input and the output are accessed using the stride function h. Inaddition, the first recursion step (3.7) contains the diagonal matrix, which scales the input. Thisformally captures the fact that (3.7) can be implemented using the dft scaled function, and (3.8)can be implemented with the dft str function in Implementation 7.

In summary∑

-SPL and the simple application of∑

-SPL rewriting rules reveals the two recur-sion steps necessary for the Cooley-Tukey FFT. The pseudo code for the implementation of (3.6)now looks essentially the same as in FFTW 2.x:


Implementation 8 (Compute y = DFTn x following (3.6))

void dft( int n, complex *Y, complex *X)

int k = choose_factor(n)


dft_str(k, n/k, 1, X + i, T + i*k)


dft_scaled(n/k, k, precomputed_doh + i*k, Y + i, Y + i)

Note, that the diagonal has to be precomputed according to the scalar function d hi, k in (3.7).Different segments of the precomputed diagonal are passed to dft scaled in different iterations.

Step 6: Implement recursion steps. The final step is to implement the recursion steps (3.7)and (3.8), by repeating Steps 1–5. The procedure is analogous, and the recursion step decorationconstructs (e.g., S(h) and G(h) in (3.8)) will be automatically merged with the decorations insidethe expansion of the DFT using loop merging rules in Table 2.7. We will show this only for (3.8),assuming the radix m | k:

S(hjk, 1)DFTk G(hj, n/k) = S(hjk, 1)(DFTk/m

⊗ Im) diag

(d)(Ik/m ⊗

DFTm

)Lk

k/mG(hj, n/k)

=

(m−1)∑

l=0

S(h(jk+l), m)DFTk/m diag

(d hl, m

)G(hl, m)

·

(k/m−1)∑

r=0

S(hrm, 1)DFTm G(h(j+(nr)/k), n/m)

. (3.9)

Observe that the final result is similar to (3.6). The transform sizes, the strides, and other parame-ters differ, but the two recursion steps still correspond to dft scaled and dft str. In other wordsno new recursion steps are needed.

If we assume that the recursion step (3.7) is always implemented as a codelet (as in FFTW2.x), then there is no need to recursively implement it. Under this assumption, the recursion stepsDFTn

, (3.7), and (3.8) form a recursion step closure, i.e., a sufficient set of recursion steps to

recursively implement the DFT using Cooley-Tukey FFT.

3.3.2 Overview of the General Algorithm

Next, we define the terminology we will use and then explain the steps in the previous example ina more complete and rigorous way.

A recursion step is a transform with additional “context”, e.g., it is usually surrounded byscatter and gather constructs as in (3.7) and (3.8). A recursion step, after applying a procedurethat we call parametrization, specifies the exact signature of the associated recursive function, suchas dft str in Implementation 7.

We refer to the recursion step expanded using a breakdown rule and rewritten as explainedin Steps 1–5 in Section 3.3.1 as a

∑-SPL implementation of a recursion step. The process of

obtaining a∑

-SPL implementation is called descending (into the recursion step). For example,(3.6) is a

∑-SPL implementation of DFTn; indeed, it can be converted into the DFT code in

Implementation 8.


Computerecursion step

closure

Breakdown rules Transforms

Parameterize + Descend

New rsteps?

Generate base cases

∑-SPL implementations

yes

no

Figure 3.3: Library generation: “Library Structure”. Input: transforms and breakdown rules. Output: the recursionstep closure (if it exists) and

P

-SPL implementations of each recursion step.

A∑

-SPL implementation is a∑

-SPL formula, which expresses the given recursion step interms of other (smaller) recursion steps, or in terms of primitive

∑-SPL constructs. The former

is a recursive∑

-SPL implementation obtained by descending, and the latter is non-recursive,implemented by a base case. Two examples of recursive

∑-SPL implementations are (3.9) and

(3.6).

The main steps performed by the Library Structure block are shown in Figure 3.3. The input is aset of transforms (which are the simplest cases of recursion steps) and breakdown rules. The outputis the recursion step closure and recursive and non-recursive (base case)

∑-SPL implementations

of each recursion step.

In this section we discuss the “Compute recursion step closure” block, which takes as inputbreakdown rules and recursion steps (the initial transform is a recursion step) and produces the∑

-SPL closure and recursive general-size implementations. This is done by parametrizing anddescending into the recursion steps, until no new recursion steps are found. The “Generate basecases” block, discussed in Section 3.4, adds additional, non-recursive fixed-size implementations forseveral recursion steps.

To make recursion step formulas self-contained functions, we need to replace all expressionswith free variables (such as loop indices) by a set of independent parameters. These parameterswill become the function arguments. We call this process parametrization of the recursion step; itis explained next.

3.3.3 Parametrization

We denote recursion step parameters with ui. Besides the generated name, which contains nosemantic information, each parameter has a type (e.g. integer, real number, etc.), which is auto-matically assigned during the parametrization. The parametrization involves three steps:

1. replace every expression with free variables by a new parameter of the same type;


2. determine the equality constraints on the parameters; and

3. find the smallest set of necessary parameters based on these constraints.

The procedure and further details are best explained using our running example, DFTn.The parametrization of DFTn is trivial. We replace n = u1. There are no constraints, so

DFTu1is a parametrized recursion step. After descending we obtain (3.6) and the two recursion

steps (3.7) and (3.8), which we have to parametrize.For completeness, we include the domains and ranges of the index mapping functions, which

we omitted before. We start with (3.8). First, we replace expressions with free variables by newparameters:

S(hk→njk, 1 )DFTk G(hk→n

j, m )par−−→ S(hu1→u2

u3, 1 )DFTu4G(hu5→u6

u7, u8).

Note that constants, such as 1, survive parametrization. The result above is not yet a valid∑

-SPLformula, since the matrix dimensions do not necessarily match.

Next, we determine the parameter constraints that make the formula valid. In∑

-SPL one needsto check matrix products and function compositions. In the example, only the matrix product hasto be valid, which means

u1 = u4, u4 = u5.

The constraints partition the parameters into groups of interrelated parameters. In each group, oneof the parameters is assumed to be known, and the resulting system of linear equations is solvedfor the rest of the parameters. The solution is then substituted into the

∑-SPL formula. In this

example, we assume u1 to be known. Solving the trivial linear system gives u4 = u5 = u1, whichyields the final result:

S(hu1→u2


u7, u8). (3.10)

Next, we consider (3.7). The actual definition of the diagonal matrix in (3.7) in Spiral, containsa special marker pre(·) (which we did not show) that tells the system that elements of the diagonalare to be precomputed. Therefore it makes sense to abstract away the particular generating functionof the diagonal. We do this by allowing parameters to also be functions denoted as before. Wewill denote such parameters using the previous notation, as uA→B. As before, A and B specifythe domain and the range of the function u, with the additional caveat that both A and B can beinteger parameters themselves, in which case they denote IA and IB .

With this extension the parametrization follows the same steps as in the previous example, andthe final result is

S(hi, k)DFTn/k diag(pre(d hi, k)

)G(hi, k)

par−−→S(hu1→u2

u3, u4)DFTu1

diag(pre(u7

u1→C))G(hu1→u9

u10, u11). (3.11)

Each parametrized recursion step can be used to mechanically construct a function declaration.The parameters become function arguments and the function name can be assigned in a variety ofways, including building a name hash or linearizing a

∑-SPL formula into text. In the example

below, we create function declarations for (3.10) and (3.11) using simple numbering to assignnames:

void rstep1( int u1, int u2 , int u3 , int u6, int u7 ,

int u8 , complex *Y, complex *X);


S(hu1→u2


u7, u8)

Descend−−−−−−−→u1/f1−1∑

i=0

S(hf1→u2

u3+i, u1/f1)DFTf1

diag(pre(Ωu1

u1/f1 hf1→u1

i, u1/f1))G(hf1→u1

i, u1/f1)

·f1−1∑

j=0

S(h

u1/f1→u1

u1j/f1, 1 )DFTu1/f1G(h

u1/f1→u6

u7+u8j, u8f1)

(3.12)

S(hu1→u2


u7, u8)

RDescend−−−−−−−→

S(hu1→u2


u7, u8)

S(hu1→u2

u3, u4)DFTu1

diag(pre(u7

u1→C))G(hu1→u9

u10, u11)

(3.13)

S(h∗→∗∗, 1 )DFT∗ G(h∗→∗

∗, ∗ )RDescend∗−−−−−−−→

S(h∗→∗∗, 1 )DFT∗ G(h∗→∗

∗, ∗ )

S(h∗→∗∗, ∗ )DFT∗ diag

(pre(∗∗→C)

)G(h∗→∗

∗, ∗ )

(3.14)

Table 3.1: Descending into the recursion step (3.10) using Cooley-Tukey FFT (3.1). The left-hand side is the originalrecursion step, and the right-hand sides are as follows: “Descend” is the result of application of the breakdown rule andP

-SPL rewriting, “RDescend” is the list of parametrized recursion steps needed for the implementation, “RDescend*”is same as “RDescend”, but uses “*” to denote the parameter slot.

void rstep2( int u1, int u2 , int u3 , int u4,

complex (*u7)( int ), int u9 , int u10 , int u11 ,

complex *Y, complex *X);

The order of the parameters is based on the order in the formula. These mechanically derivedfunction declarations should be compared to the declarations of dft str and dft scaled in Imple-mentation 7. The declarations above have more parameters, and also uses a pointer to a generatingfunction (u7) for the scaling diagonal entries, instead of an array pointer. Extra parameters arean artifact of an automated system, and it might be possible to prove their redundancy undercertain circumstances. The function pointers, on the other hand, provide a convenient abstraction.The diagonal entry precomputation is handled by the Library Implementation module (Chapter 5),which also replaces function pointers by array pointers, where precomputation was desired.

The parametrization accomplishes an important task: it creates reusable recursion steps.

3.3.4 Descend

After parametrization we descend into each recursion step as was outlined earlier. Continuing withthe example, we now descend into (3.10) and (3.11) to obtain their implementations. We onlydiscuss (3.11) in detail.

We expand (3.11) using again (3.1). Note that in (3.1), there is one degree of freedom, the valueof k, which will become an additional parameter that we will denote with fi. Inserting (3.1) into(3.11) yields

S(hu1→u2

u3, 1 )(DFTf1

⊗ Iu1/f1

) diag(pre(d)

)(If1⊗DFTu1/f1

)Lu1

f1G(hu1→u6

u7, u8).

There are four differences compared to (3.1): the DFT size is u1, the single degree of freedom k in(3.1) is made explicit and fixed (k = f1), the diagonal is marked with the precompute marker pre(·),


1: DFT∗

2: S(h∗→∗∗, 1 )DFT∗ G(h∗→∗

∗, ∗ )

3: S(h∗→∗∗, ∗ )DFT∗ diag

(pre(∗∗→C)

)G(h∗→∗

∗, ∗ )

(a) Restricted recursion

1: DFT∗

2: S(h∗→∗∗, 1 )DFT∗ G(h∗→∗

∗, ∗ )

3: S(h∗→∗∗, ∗ )DFT∗ diag

(pre(∗∗→C)

)G(h∗→∗

∗, ∗ )

4: S(h∗→∗∗, 1 )DFT∗ diag

(pre(∗∗→C)

)G(h∗→∗

∗, ∗ )

(b) Full recursion

Figure 3.4: Graphical representation of the recursion step closure obtained from the Cooley-Tukey FFT (3.1). Theclosure in (b) corresponds to (3.15).

and is given in terms of its generating function Ω (same as d before), and finally the potential newrecursion steps (DFTs) are marked but are not yet finalized.

Next,∑

-SPL conversion and further rewriting is performed using∑

-SPL loop merging andindex simplification rules. The result is given in (3.12) in Table 3.1.

The result looks complicated due to the high level of detail, but it is a completely specifiedimplementation of (3.11) using the Cooley-Tukey rule (3.1). For the recursion step closure we areonly interested in the required children. This is accomplished by RDescend, which extracts theparametrized children from the result of Descend, as shown in (3.13). For readability it makessense to drop the parameter names, and replace them with “*”. This is done by RDescend* in(3.14).

3.3.5 Computing the Closure

Iterating RDescend will produce the desired closure. Inspection shows that the recursion steps in(3.13) are equal to (3.10) and (3.11). Therefore, no new recursion steps are needed.

If we do not allow recursion for (3.10), the recursion step closure is complete with (3.10) and


(3.11). The closure can be visually represented using a call graph. For example, the call graph forthe closure with this restricted recursion is shown in Figure 3.4(a). The nodes are the recursionsteps (here shown as “*”-ed versions). The edges show the result of RDescend*.

To obtain the full closure without recursion restrictions, we apply RDescend also to (3.10).(3.10) spawns two recursion steps, itself and a new specialized variant, which itself produces nonew recursion steps. This yields the result below:

DFT∗Closure*−−−−−→

DFT∗S(h∗→∗

∗, 1 )DFT∗ G(h∗→∗∗, ∗ )

S(h∗→∗∗, ∗ )DFT∗ diag

(pre(∗∗→C)

)G(h∗→∗

∗, ∗ )

S(h∗→∗∗, 1 )DFT∗ diag

(pre(∗∗→C)

)G(h∗→∗

∗, ∗ )

(3.15)

The overall recursion step closure has now four recursion steps. The associated call graph isshown in Figure 3.4(b).

The first three recursion steps in Fig. 3.4(a) and Fig. 3.4(b) correspond to the dft, dft str,and dft scaled functions in Implementation 7. The fourth recursion step in Fig. 3.4(b) is a specialcase of the third, with the one parameter (the scatter stride) being equal to 1. This information isuseful, because it can lead to better performance due to reduced index computation.

We now state the general algorithm for a single breakdown rule below:

Algorithm 1 (Recursion step closure for a single breakdown rule) Given: A transform T and a break-down rule B that decomposes T into transforms of the same type. Find: The recursion step closureR as a set of

∑-SPL formulas.

Closure(T, B)

1: R← 2: W ← Tu1

3: while W 6= do

4: R← R ∪W

5: W ←(⋃

w∈W

RDescend(w,B)

)

\R

6: end while

7: return R

The procedure for computing the closure is as follows. Given a transform T , we initialize theclosure R = Tu1

. Then, we apply RDescend to every element in R, adding the new recursionsteps to R. This is repeated until R does not grow any more. The worklist W is used to keep trackof new (not yet descended into) recursion steps.

3.3.6 Handling Multiple Breakdown Rules

The extension to multiple breakdown rules is rather straightforward. At every iteration of Algo-rithm 1, instead of applying a single breakdown rule, we apply all applicable rules to all recursionsteps in the worklist W . The algorithm is formally stated below.

Algorithm 2 (Recursion step closure for multiple transforms and multiple breakdown rules) Given:A set of breakdown rules B = Bi for transforms T = T i (not necessarily all different). Find:


The recursion step closure R to compute all T i using all possible recursive combinations of appli-cable breakdown rules.

Given a set of parametrized recursion steps W , we denote with W (T ) = w ∈W | T occurs in w.Closure(T, B)

1: R← 2: W ←

⋃

i

T iu1

3: while W 6= do

4: R← R ∪W

5: W ←

⋃

i

⋃

w∈W (T i)

RDescend(w,Bi)

\R

6: end while

7: return R

The only difference between Algorithm 2 and Algorithm 1 are lines 2 and 5. The algorithmstarts out with the closure being equal to the set of the given transforms. The closure is expandedby applying all applicable breakdown rules to the new recursion steps, until no new recursion stepsappear.

3.3.7 Termination

The recursion step closure computation using Algorithm 1 or Algorithm 2 is not guaranteed toterminate; or, in other words, the closure could be infinite. Termination strongly depends on therewriting rules in Table 2.7 and, most importantly, on the index-mapping function simplificationrules in Table 2.8. For example, in the Cooley-Tukey FFT, the three applicable rules are (2.54),(2.55), and (2.56). These form the minimal set of needed rewrite rules. If any one is removed, theclosure becomes infinite.

3.3.8 Unification: From∑

-SPL Implementations to Function Calls

If we want to compile a complicated∑

-SPL expression like (3.12) to code, we need to be ableto generate function calls to other recursion steps. This topic is discussed in additional detail inChapter 5, and here we give the basic intuition.

To generation a function call, we need to determine the arguments for it. To convert therecursion step call

F

into a function call, we perform the so-called unification [9] ofF

withthe parametrization of F . This provides the concrete values for the parameters ui of the invokedrecursion step. These values are then used to produce a function call in the code. For example,

Unify(S(hk→njk, 1 )DFTk G(hk→n

j, n/k), S(hu1→u2


u7, u8))

= (u1 → k, u2 → n, u3 → jk, u6 → n, u7 → k, u8 → n/k)

Using the result above, the pseudo-code for the function call then becomes

rstep1(k, n, jk, n, k, n/k);

3.4. GENERATING BASE CASES 51

3.4 Generating Base Cases

In Section 3.3 we explained how to compute the recursion step closure and obtain implementationsof each recursion step. These implementations are parametrized and, in particular, are for generalinput size. However, at runtime, the recursion must be terminated with fixed size base cases. Inthis section we will address the problem of automatically generating such base cases. In FFTW,the base cases are called codelets.

The base cases should be available for a range of fixed small sizes to reduce recursion overhead.A straightforward approach is to generate base cases for each recursion step.

To generate base case implementations we assume the list of desired fixed size transforms isknown in advance, e.g., B = DFT2,DFT3,DFT4, and follow the following steps.

1. Generate the set of fixed size recursion step formulas by taking the cross-product of sizes andrecursion steps.

2. Find the subset of fixed-size transforms needed and use the standard Spiral system to searchfor recursion free implementations of these transforms.

3. Form the fixed size implementations of the recursion steps by plugging in transform imple-mentations from step 2 into the fixed size recursion step formulas from step 1.

We will now walk through each step using the closure (3.15) as the running example.Generate fixed size recursion step formulas. Each recursion step is matched against each

fixed size transform in B to determine the constant parameters (including, but not limited to,the size). The parameters are then inserted into the recursion step formula to obtain a fixed sizeformula.

Consider, for instance, the matching of (3.10) against DFT2, which yields u1 = 2. Inserting 2for u1 in (3.10) leads to

S(h2→u2

u3, 1 )DFT2 G(h2→u6

u7, u8). (3.16)

After matching all elements of B against our closure we obtain the following set of formulas:

DFT2

DFT3

DFT4

S(h2→u2

u3, 1 )DFT2 G(h2→u6

u7, u8)

S(h3→u2

u3, 1 )DFT3 G(h3→u6

u7, u8)

S(h4→u2

u3, 1 )DFT4 G(h4→u6

u7, u8)

S(h2→u2

u3, u4)DFT2 diag

(pre(u7

2→C))G(h2→u9

u10, u11)

S(h3→u2

u3, u4)DFT3 diag

(pre(u7

3→C))G(h3→u9

u10, u11)

S(h4→u2

u3, u4)DFT4 diag

(pre(u7

4→C))G(h4→u9

u10, u11)

S(h2→u2

u3, 1 )DFT2 diag(pre(u6

2→C))G(h2→u8

u9, u10)

S(h3→u2


3→C))G(h3→u8

u9, u10)

S(h4→u2


4→C))G(h4→u8

u9, u10)

. (3.17)

Find and implement needed transforms. In our example, this step is trivial. Searchingfor transforms in (3.17) reveals that we need DFT2,DFT3,DFT4. In the general case, however,this step can be more involved, due to possible additional auxiliary transforms. This happens, for


example, for breakdown rules like (2.6). Due to the large number of transforms, the set B lists basecases for all possible transforms, and this step chooses the right subset.

For each transform, we use the standard Spiral system [99] to search for the best implementationand generate a fixed size

∑-SPL formula.

For example, for DFT2 there is only a single implementation

DFT2 =[

1 11 −1

].

Form the fixed size recursion step formulas. The results of the previous step are pluggedinto the formulas of (3.17). This yields:

[1 11 −1

]

. . .

S(h2→u2

u3, 1 )[

1 11 −1

]G(h2→u6

u7, u8)

. . .S(h2→u2

u3, u4)[

1 11 −1

]diag

(pre(u7

2→C))G(h2→u9

u10, u11)

. . .

S(h2→u2

u3, 1 )[

1 11 −1

]diag

(pre(u6

2→C))G(h2→u8

u9, u10)

. . .

(3.18)

Note that these∑

-SPL implementations in (3.18) are recursion free, fixed size, but are stillparametrized.

3.5 Representing Recursion: Descent Trees

We can conveniently visualize the recursive computation of transforms using the so-called descenttrees. Besides being an indispensable visualization aid in the following sections, descent trees areuseful for performance modeling, as briefly explained below.

The nodes of the descent trees are the recursion steps, and the edges connect to the childrecursion steps as a result of the application of the descend operation.

An example of a descent tree for the computation of DFT32 using (3.1) is shown in Fig. 3.5.Every time a descend is applied to the recursion steps, several “children” recursion steps are needed,and thus we obtain a tree.

Since we consider the generation of general-size libraries, unlike ruletrees in standard Spiral,these descent trees are unfolded at runtime (instead of at code generation time).

A descent tree is similar to a ruletree used in Spiral and described in [99], but has severalimportant differences, highlighted in Table 3.2

Perhaps the most important difference is that descent trees capture the context of the compu-tation better than the ruletree. For example, the input and output strides are easily available inthe descent tree, while they are difficult to obtain from the ruletree. The strides come from the∑

-SPL description of each recursion step, namely from the index mapping functions.

We use descent trees as a visual aid in explaining various breakdown rule combinations in thefollowing sections.

Another useful application of descent trees is for performance modeling and machine learningapplications. For example, the work by Singer [109, 110] attempts to constructs optimal ruletrees

3.6. ENABLING LOOPED RECURSION STEPS: INDEX-FREE∑

-SPL 53

1: DFT32

Descend, f1 = 4

2: S(h8→∗∗, 1 )DFT8 G(h8→∗

∗, ∗ )

Descend, f1 = 2

3: S(h4→∗∗, ∗ )DFT4 diag

(pre(∗4→C)

)G(h4→∗

∗, ∗ )

Base case

2: S(h4→∗∗, 1 )DFT4 G(h4→∗

∗, ∗ )

Base case

3: S(h2→∗∗, ∗ )DFT2 diag

(pre(∗2→C)

)G(h2→∗

∗, ∗ )

Base case

Figure 3.5: Descent tree for DFT32.

Descent tree Ruletree

Nodes recursion steps transformsEdges RDescend breakdown rule applicationLeaf size any (user setting) fixed by breakdown rulesContext yes no

Table 3.2: Comparison of descent trees and ruletrees.

using automated machine learning. Part of the problems raised in the papers is the need for morecontext information. The authors obtain this information by computing it directly from the ruletree,but this analysis has to explicitly “understand” different breakdown rules to compute this context.In other words, support for the breakdown rules had to be hard-coded in their infrastructure. Usingthe descent trees, on the other hand, this information is readily available.

3.6 Enabling Looped Recursion Steps: Index-Free∑

-SPL

All of our previous examples involve recursion steps with a single instance of a transform andadditional “context” constructs, including gather, scatter and diagonals. In this section we discussthe looped recursion steps, which include iterative sums, and thus an even larger context. We willfirst explain why this is desirable, and then explain why this is not achievable with the standard∑

-SPL and parametrization.

To solve this problem, we develop an extension to∑

-SPL called the index-free∑

-SPL. Itenables the parametrization and further manipulation of looped recursion steps, and was inspiredby the concept of I/O tensors in FFTW 3.x [61]. They are used to represent “DFT problems”, whichare equivalent to recursion steps in our terminology. In fact, one can use index-free

∑-SPL as a

rigorous mathematical description of the “DFT problems” in FFTW. The transformation from∑

-SPL to the index-free

∑-SPL is closely related to λ-lifting [71] used in compilers for programming

languages that support higher-order functions.


3.6.1 Motivation

There are three motivations for supporting the looped recursion steps. First, they immediatelyenable the generation of looped base cases. In practical terms, this means that, for example, theapplication of the Cooley-Tukey FFT (3.1) breakdown rule to the DFT, instead of yielding a pairof smaller DFTs, yields a pair of loops over these smaller DFTs. In the former case, only thesmaller DFTs can be implemented in a base case, and in the latter case, the base case can alsocontain these outer loops over the small DFTs. This considerably improves the performance of thegenerated code, since the target language compiler (C++ compiler in our case) can better optimizethe function implementing the base case.

Second, they enable various loop optimizations by making it possible to create breakdown rulesalso for loops, rather than only for transforms. In Section 3.7 we describe a number of theseoptimizations. Further, we implement vectorization and parallelization (Chapter 4) in the sameway.

Third, we believe that the combination of breakdown rules for loops and transforms provides avery elegant way to describe the implementation space.

However, as we show next, directly applying the parametrization to looped∑

-SPL expressionsis not possible, and this leads us to the index-free

∑-SPL, which solves the problem.

Small example. To demonstrate the problems arising with parametrization of looped recursionsteps, we show a small example. Consider, the parametrization of

DFTn⊗Ik

. First, we convert

the expression to∑

-SPL to obtain:

k−1∑

j=0

S(hn→knj, k )DFTn G(hn→kn

j, k ). (3.19)

The index of the iterative sum j controls the base address of the index mapping functions. If weparametrize (3.19) by applying the parametrization algorithm from Section 3.3.3, first, all of thescalar expressions are replaced by parameters, to obtain

u10∑

j=0

S(hu1→u2

u3, u4)DFTu5

G(hu6→u7

u8, u9). (3.20)

Note that the loop bound is replaced by u10. Next, constraints are applied, to yield u1 = u5 = u6,and after back-substitution of u5 and u6, we get

u10∑

j=0

S(hu1→u2

u3, u4)DFTu1

G(hu1→u7

u8, u9). (3.21)

Unfortunately, (3.21) is not a valid parametrization, because all references to the index j havedisappeared, and instead we obtained the parameters u3 and u8. One cannot have parameters inthese slots, because the actual values are functions of the loop variable, and are computed withinthe recursion step itself.


-SPL 55

3.6.2 Index-Free∑

-SPL: Basic Idea

Instead of trying to modify the parametrization algorithm explained before we designed a specialform of

∑-SPL which does not use loop indices, and thus does not incur the problem described

above.

The problem with loop indices is that they are, in some sense, a departure from a purelydeclarative language. The use of explicit loop indices establishes a link between the declaration ofthe index (i.e., in the iterative sum) and all uses of the index in various arithmetic expressions.When these expressions are replaced by the parameters, the link is lost. So our solution is toeliminate the loop indices, which in addition makes

∑-SPL purely declarative.

We designed the index-free∑

-SPL based on an important insight in FFTW 3 [61]. Namely, theuse of the so-called vector strides to describe loops over transforms. For example, (3.19) becomesin index-free

∑-SPL

∑

k

S(hn→kn0, k,1 )DFTn G(hn→kn

0, k,1 ). (3.22)

In (3.22) the iterative sum does not provide a loop variable; instead, k denotes the number ofiterations. The index mapping functions now give a pair of strides: the standard stride = k, andthe vector stride = 1. They can be interpreted as 2-variable functions, the extra variable is theloop index. Compare the two functions below:

Original: hn→knj, k : i 7→ j + ki,

Index-free: hn→kn0, k,1 : (j, i) 7→ 0 + 1j + ki = j + ki.

We call the 2-variable version of h a ranked function. The rank of h is 1, which means that it has1 extra variable corresponding to 1 loop index.

Our transformation is equivalent to λ-lifting [71], where the goal is to implement nested func-tions. 1-variable nested functions are transformed into 2-variable global functions.

However, in addition to the pure λ-lifting we hide the extra arguments by using predefinedranked functions (like h), which makes the notation declarative.

3.6.3 Ranked Functions

In order to eliminate loop indices, we capture the inter-iteration behavior in an alternative way.First, we need to identify the objects that are affected by loop indices. In the examples, we see thatthe most important such objects are gathers, scatters, and diagonals parametrized by functions.Thus, the crucial step is to eliminate the loop index dependence of functions. By doing so we makea large class of

∑-SPL formulas index-free.

This is accomplished by using multivariate functions, as we have already shown above in (3.22).The loop index dependence is captured inside the function definition in a structured way. We callthese special multivariate functions ranked functions. The rank of the function denotes the numberof enclosing loops (

∑operators) that affect it. A regular function (without loops) fA→B has rank

0.A rank-1 function must be in the scope of at least one loop,

∑. The loop index of the sum

becomes an extra parameter. For example, below, g is a rank-1 function that lives inside a loopwith n iterations:

g : In ×A→ B; (j, i) 7→ g(j, i).


hn→Nb, s0,s1,...,sk

: Zk × In → IN ; (jk, . . . , j1, i) 7→ b + is0 +

∑

l=1...k

skjk, (3.23)

zn→Nb, s0,s1,...,sk

: Zk × In → IN ; (jk, . . . , j1, i) 7→ b + is0 +

∑

l=1...k

skjk mod N, (3.24)

rn→Nb, q, s0,s1,...,sk

: Zk × In → IN ; (jk, . . . , j1, i) 7→

b + is0 +∑

l=1...k

skjk, i <⌊

n2

⌋

q − b− is0 −∑

l=1...k

skjk, else.(3.25)

Table 3.3: Rank-k index mapping functions.

The first parameter of the function above is the loop index. If we have a rank-0 function fjA→B

parametrized by the loop index j, we remove the explicit reference to j by using a rank-1 functiong above, as fj(i) = g(j, i). If the transformation is applied to all functions, the explicit loop indexis no longer necessary. For example, we obtain

n−1∑

j=0

S(fj)AG(fj) =∑

n

S(g)AG(g).

Since we don’t need the loop index in the right-hand side above, we just write the number ofiterations under the sum. We also assume that ranked functions can appear anywhere in the

∑-

SPL syntax tree, and, that the loop indices (when they are introduced) are automatically bound tothe right slots of ranked function parameters. Therefore, for all purposes ranked functions can betreated (e.g., by gather and scatter matrices) as the single argument functions that we had before.

In general, we don’t need to keep track of the interval sizes (i.e., loop bounds) for loop indexvariable slots in each ranked function. After all, we do have the loops available in the formula. Sofor convenience, we relax the definition slightly, so that a rank-k function is defined as

g : Z× · · · × Z︸︷︷︸

Zk

×A→ B : (jk, . . . , j1, i) 7→ g(jk, . . . , j1, i).

We order loop index function arguments in “outer-loop first” order. The outermost loop index jk

is first, and the innermost j1 is last, before i, which is the regular function argument of a rank-0function. In other words the subscript of j denotes how far away is the associated loop, with 1being the closest (innermost).

In Table 3.3 we generalize three important index mapping functions to ranked functions. h andz both are defined in Section 2.4 and are used in the Cooley-Tukey (2.24) and the prime-factorFFT (2.25), respectively, r is used in RDFT and DCT algorithms.

3.6.4 Ranked Function Operators

Now we define several operators necessary to manipulate ranked functions. We start by defining anappropriate version of composition, and proceed with other operations specific to ranked functions.


-SPL 57

Composition. We define the composition of ranked functions in a special way to be compatiblewith regular composition. For example, for a pair of (non-ranked) functions fj and gj that dependon the loop index (fj gj)(i) = fj(gj(i)), and the index j does not affect the composition. Our goalis to define the ranked function composition in a similar way.

To accomplish this, we define the composition for ranked functions, such that the loop indexarguments are not passed along through the composition chain. For example, for a pair of rank-1functions, composition is defined by:

fZ×B→C gZ×A→B : Z×A→ C; (j1, i) 7→ f(j1, g(j1, i)).

Composable functions do not have to be of the same rank. The rank of the result is equal tothe maximal rank of the composed functions. Formally, given a rank-k function f and a rank-nfunction g, their composition is defined as:

fZk×B→C gZn×A→B : Zr ×A→ C; (jr, . . . , j1, i) 7→ f(jk, . . . , j1, g(jn, . . . , j1, i)), r = max(k, n).

Downrank. Ranked functions are tied to the loops (iterative sums) in their immediate formulacontext. If the function needs to be taken out of context, the implicit loop dependencies mustbe made explicit by introducing the explicit loop variable, and hence decreasing the rank of thefunction. We call this operation downranking. We say that a function is fully downranked if its rankis reduced to 0. Fully downranking a function is equivalent to converting a formula from index-free∑

-SPL to regular∑

-SPL. Downranking is equivalent to function currying [120]. Formally, for arank-1 function f , the downrank operation is defined as follows:

down(fZ×A→B, j, 1) : A→ B; i 7→ f(j, i), rank(f) = 1.

The parameter 1 specifies that we are downranking the first loop (the only choice). Thus thefollowing identity holds:

∑

n

S(f)AG(f) =

n−1∑

j=0

S(down(f, j, 1))AG(down(f, j, 1)).

For a general rank-k function we can downrank with respect to any of the k loops, specified by l:

down(fZk×A→B, j, l) : Zk−1 ×A→ B;

(jk, . . . , jl−1, jl+1, . . . , j1, i) 7→ f(jk, . . . , jl−1, j, jl+1, . . . , j1, i), rank(f) = k.

Uprank. Another operator we will need is called uprank. It is not the inverse of downrank.Before we define uprank, we motivate the need for it. Consider the following restatement of rewriterule (2.44), which becomes invalid in the presence of ranked functions:

(∑

A)

M →(∑

n

AM

)

.


The rule is invalid, when applied as follows and as we explain next:

(∑

S(f)(∑

A)

G(f))

→(∑∑

S(f)AG(f))

, rank(f) = 1. (3.26)

The left-hand and right-hand sides of (3.26) are not equal. This can be shown by downranking fon both sides (and thus introducing explicit loop indices):

(∑

S(f)(∑

A)

G(f))

downrank−−−−−−→

∑

j

S(down(f, j, 1))

(∑

k

A

)

G(down(f, j, 1))

(3.27)

(∑∑

S(f)AG(f))

downrank−−−−−−→

∑

j

∑

k

S(down(f, k, 1))AG(down(f, k, 1))

(3.28)

Above, (3.27) is correct, and (3.28) is invalid, because f is downranked with k (instead of j) as aloop variable. Originally, f is a rank-1 function tied to its immediate parent loop, which happensto be the outer loop over j. If a rewrite rule is applied, the context changes, and the immediateparent loop of f becomes the inner loop over k. This leads to the invalid result in (3.28).

To fix this problem, we need to increase the rank of f to 2, when applying the rewrite rule(2.44). This is accomplished by the uprank operator, defined below for a rank-1 f .

up(fZ×A→B) : Z2 ×A→ B; (j2, j1, i) 7→ f(j2, i), rank(f) = 1.

Since there is no dependence on the new inner loop, the corresponding loop index j1 is simplyignored. The correct way to transform (3.26) uses upranking as below:

(∑

S(f)(∑

A)

G(f))

→(∑∑

S(up(f))AG(up(f)))

. (3.29)

The above is correct for an f of arbitrary rank. As a special case, when rank(f) = 0, there is noneed to uprank, as reflected in the original rule (2.44).

In the general case of a rank-k function f , upranking is defined as

up(fZk×A→B) : Zk+1 ×A→ B; (jk+1, . . . , j1, i) 7→ f(jk+1, . . . , j2, i), rank(f) = k.

3.6.5 General λ-Lifting

Using only ranked functions we can convert a very large class of∑

-SPL formulas to an index-freerepresentation. The formulas that contain references to loop indices outside of functions must behandled using the general λ-lifting scheme, explained in this section.

Consider the following SPL formula fragment from the RDFT breakdown rule (2.6):

k/2−2⊕

j=0

rDFT2m((j + 1)/k). (3.30)

Above, (j + 1)/k is a parameter of the auxiliary transform rDFT. It is not possible to convert itto an index-free

∑-SPL expression using the rewrite rules from Table 3.4, because the reference to


-SPL 59

the loop index inside the rDFT parameter will stay.

The first step is to represent a scalar expression with j by a dummy single-argument function,to make it consistent with other functions like h. We obtain:

k/2−2⊕

j=0

rDFT2m(λj(0)), λj : i 7→ (j + 1)/k

λj : I1 → R

Above, λj represents the original expression. Since it is a function we have to apply it to anargument, and the only possible argument is 0 ∈ I1.

Next, we introduce a dummy function evaluation operator λ-wrap. The intuitive meaning ofλ-wrap is “treat this function as a scalar”:

λ-wrap(f1→A) = f(0).

To convert a function to a scalar, we evaluate it at 0. The result of λ-wrap is a scalar, and it enablesto use single-argument functions where scalars are expected. Using λ-wrap, we rewrite (3.30) as

k/2−2⊕

j=0

rDFT2m(λ-wrap(λj)).

Finally, we convert λj to a rank-1 function λ, and eliminate j from the iterative sum, leaving onlythe number of iterations:

⊕

k/2−1

rDFT2m(λ-wrap(λ)), λ : (j, i) 7→ (j + 1)/k.

Even though the above is not∑

-SPL (the direct sum operator is not converted into∑

with gatherand scatter), we show its parametrization to clarify what happens with λ without cluttering withunnecessary details:

⊕

k/2−1

rDFT2m(λ-wrap(λ))par−−→

⊕

u1

rDFTu2(λ-wrap(u3

Z×1→R)).

Basically, the above is parametrized by substituting a parameter function u3 for λ.

3.6.6 Library Generation with Index-Free∑

-SPL

The library generation can use the index-free∑

-SPL instead of the regular∑

-SPL without anymajor changes. However, the index-free representation makes it possible to expand the scope of therecursion step tag to include loops, and this will no longer cause problems with parametrization.

Generating code for index-free formulas can be done using the standard∑

-SPL code generationrules from Table 2.5, if the index-free formula is converted to a regular one by downranking allindex-free functions.

All other formula manipulations, including parametrization, descending and other library re-lated manipulations, work with index-free formulas.


(Ik ⊗An)→∑

k

S(h0, 1,n)AnG(h0, 1,n), (3.31)

(An ⊗ Ik)→∑

k

S(h0, k,1)AnG(h0, k,1), (3.32)

(Im ⊗An ⊗ Ik)→∑

m

∑

k

S(h0, k,1,nk)AnG(h0, k,1,nk), or (3.33)

→∑

k

∑

m

S(h0, k,nk,1)AnG(h0, k,nk,1) (3.34)

Table 3.4: Rewrite rules for converting SPL to rankedP

-SPL.

(∑

A)

B(f)→(∑

AB(up(f)))

, B ∈ G,diag, (3.35)

B(f)(∑

A)

→(∑

B(up(f))A)

, B ∈ S,diag, (3.36)

up(f g)→ up(f) up(g), (3.37)

down(f g, j, l) → down(f, j, l) down(g, j, l), (3.38)

up(hb, s,s1,...)→ hb, s,0,s1,..., (3.39)

up(zb, s,s1,...)→ zb, s,0,s1,..., (3.40)

up(rb, s,s1,..., ) → rb, q, s,0,s1,..., (3.41)

down(hb, s,s1,..., j, l)→ hb+slj, s,s1,...,sl−1,sl,..., (3.42)

down(zb, s,s1,..., j, l)→ zb+slj, s,s1,...,sl−1,sl,..., (3.43)

down(rb, q, s,s1,..., j, l)→ rb+slj, q, s,s1,...,sl−1,sl,..., (3.44)

hb, s,s1,... hb′, t,t1,... → hb+sb′, st,s1+st1,s2+st2,..., use si, ti = 0, for i > rank(·) (3.45)

zb, s,s1,... hb′, t,t1,... → zb+sb′, st,s1+st1,s2+st2,..., use si, ti = 0, for i > rank(·) (3.46)

Table 3.5: Rewrite rules for ranked functions.

Table 3.4 gives the rules for conversion from SPL to ranked∑

-SPL; this table is an index-freeequivalent of Table 2.6 in Section 2.4.

As we explained in the previous section, some of the rewrite rules need to be modified to correctlysupport ranked functions. Table 3.5 summarizes these updated rewrite rules for ranked functions,and also gives additional rules for upranking and downranking the common index mapping functionsfrom Table 3.3.

Our library generator does not use plain index-free∑

-SPL to generate libraries. Instead, weextend this concept even further, by introducing a special loop non-terminal, or loop transform,which allows us to treat every loop as a transform and apply breakdown rules to it. This allowsto introduce degrees of freedom in loop manipulations, and open up an additional search space for

3.7. ADVANCED LOOP TRANSFORMATIONS 61

optimization. This is the topic of the next section.

3.7 Advanced Loop Transformations

In this section we explain how the index-free loop representation can be used to implement morecomplex loop transformations, namely, loop interchange, loop distribution, and strip mining. Thesetransformations are useful as memory hierarchy optimizations, but are also essential for vectoriza-tion and parallelization.

These and many other loop transformations, with certain limitations, can be performed bymodern general purpose optimizing compilers. However, one of the unsolved challenges in a generalpurpose compiler is in deciding when and with what parameters to apply these transformations.

To overcome this problem, we can reuse the feedback-driven search to guide the optimizationprocess. In order to use the Spiral search machinery, we need to define a special non-terminal(called “GT”) that represents a single or multiple nested loops, and also define breakdown rulesfor it.

As the index-free∑

-SPL, our loop optimization breakdown rules were inspired by the newdesign of FFTW 3.x [61], which provides special solvers for “vectors” of transforms (i.e. loops overtransforms). Our aim is to formalize the essense of these optimizations on the

∑-SPL level, and

thus enable generation of FFTW 3.x like libraries for a wide variety of transforms.

We proceed by explaining the loop non-terminal GT Section 3.7.1 and explain how severalimportant loop transformations can be captured using GT breakdown rules. We discuss loopinterchange in Section 3.7.2, loop distribution in Section 3.7.3, and strip-mining in Section 3.7.4.

3.7.1 GT: The Loop Non-Terminal

The standard pattern of most loops in∑

-SPL is an iterative sum with gather and scatter, so fora single loop we define the rank-1 GT as

GT(A, g, s, v) =∑

v

S(s)AG(g), rank(g) = rank(s) = 1.

We capture the case of multiple nested loops in the same construct using functions of higher rank.The rank-k GT is given by

GT(A, g, s, v1, . . . , vk) =∑

vk

· · ·∑

v1

S(s)AG(g), rank(g) = rank(s) = k.

Above, v1 corresponds to the number of iterations in the innermost loop, and vk in the outermost.The ordering of the vi’s corresponds to the ordering of the vector strides in ranked functions.

It is easy to see how familiar SPL constructs map to the GT non-terminal, by comparing GTagainst the index-free

∑-SPL notation. We show several examples in Table 3.6.

In the case of the double tensor product, the representation is not unique, because the twoloops are interchangeable, and thus the loop order can be chosen arbitrarily. For a rank-k GT, thenumber of possible loop orderings is k!.

Example: Descent tree for the DFT32 based on GT. Previously, we showed a descenttree for the DFT32 implementation using standard

∑-SPL in Fig. 3.5. The former descent tree


SPL Index-free∑

-SPL GT

(Ik ⊗An)∑

k

S(h0, 1,n)AnG(h0, 1,n) GT(A, h0, 1,n, h0, 1,n, k)

(An ⊗ Ik)∑

k

S(h0, 1,n)AnG(h0, 1,n) GT(A, h0, k,1, h0, k,1, k)

(Im ⊗An ⊗ Ik)∑

m

∑

k

S(h0, k,1,nk)AnG(h0, k,1,nk) GT(A, h0, k,1,nk, h0, k,1,nk, k,m)∑

k

∑

m

S(h0, k,nk,1)AnG(h0, k,nk,1) GT(A, h0, k,nk,1, h0, k,nk,1, m,k)

Table 3.6: Converting SPL to rankedP

-SPL and GT.

has nodes that are smaller DFTs decorated with extra constructs. Treating loops as non-terminalsmakes the tree larger, since it adds intermediate nodes.

In particular, the single descend step into DFT32 using the Cooley-Tukey rule (3.1) producesa pair of GT recursion steps (i.e., a pair of loops), instead of decorated DFTs.

DFTn

= (DFTk

⊗ Im) diag(d)(Ik ⊗

DFTm

)Ln

k

= GT(DFTk diag(d h0, m,1), h0, m,1, h0, m,1, m) ·GT(DFTm, h0, k,1, h0, 1,m, k).

To further deal with the new non-terminals we need breakdown rules. To obtain the equivalent ofthe previous descent tree, we define a downrank breakdown rule, which eliminates GT by producinga regular (non-index-free)

∑. For rank-1 GT it is defined as follows:

GT(A, g, s, v) GT-DownRank−−−−−−−−−→f

v−1∑

j=0

S(down(s, 1, j)) down(A, 1, j)G(down(g, 1, j)). (3.47)

The new descent tree interleaves the Cooley-Tukey (DFT-CTA) and the downrank (GT-DownRank)breakdown rules and is shown in Fig. 3.6.

.

3.7.2 Loop Interchange

The goal of loop interchange optimization is to change the order of loops to optimize for memorylocality [134,139].

Table 3.8 shows an example of loop interchange for a trivial program. In the example, inter-changing the loops leads to better memory locality.

Original code After loop interchange

for ( int i=0; i<100; ++i)

for ( int j=0; j<15; ++j)

y[100*j + i] = x[100*j + i];

for ( int j=0; j<15; ++j)

for ( int i=0; i <100; ++i)

y[100*j + i] = x[100*j + i];

Table 3.7: Loop interchange example.


DFT32

DFT-CTA, f = 4

GT(DFT8, h0, 4,1, h0, 1,8, 4)GT-DownRank

GT(DFT4 diag(∗), h∗, 8,1, h∗, 8,1, 8)

GT-DownRank

S(h∗, 1)DFT8 G(h∗, 4)DFT-CTA, f = 2

S(h∗, 8)DFT4 diag(∗)G(h∗, 8)

Base case

GT(DFT4, h0, 8,1, h0, 1,4, 2)GT-DownRank

GT(DFT2 diag(∗), h∗, 4,1, h∗, 4,1, 4)

GT-DownRank

S(h∗, 1)DFT4 G(h∗, 8)Base case

S(h∗, 4)DFT2 diag(∗)G(h∗, 4)

Base case

Figure 3.6: Descent tree for DFT32.

Implementation in Spiral. Given a loop nest expressed as rank-k GT, we introduce a degreeof freedom 1 ≤ f ≤ k in the downrank breakdown rule, which selects the loop to downrank next.The loop that is downranked first becomes the outer loop. The revised breakdown rule is

GT(A, g, s, v1, . . . , vk) GT-DownRank−−−−−−−−−→f

vf−1∑

j=0

down(GT(A, g, s, v1, . . . , vk), f, j). (3.48)

The actual downrank operation on the GT is defined as

down(GT(A, g, s, v1, . . . , vk), f, j)

= GT(down(A, f, j), down(g, f, j), down(s, f, j), v1, . . . , vf−1, vf+1, . . . , vk).

The degrees of freedom within the breakdown rule are exposed to the search engine. For example,for the rank-2 GT, two descent trees are possible, as illustrated in Fig. 3.7.

The final∑

-SPL expressions that result from descending differently are almost identical, exceptthat the loops (iterative sums) are reordered, as expected.


GT(A, h0, k,1,nk, h0, k,1,nk, k,m)GT-DownRank, f = 2

GT(A, h∗, k,1, h∗, k,1, k)GT-DownRank, f = 1

S(h∗, k)AG(h∗, k)

Result:m−1∑

i=0

k−1∑

j=0

S(hnki+j, k)AG(hnki+j, k)

(a) Outer loop first

GT(A, h0, k,1,nk, h0, k,1,nk, k,m)GT-DownRank, f = 1

GT(A, h∗, k,nk, h∗, k,nk, m)GT-DownRank, f = 1

S(h∗, k)AG(h∗, k)

Result:k−1∑

j=0

m−1∑

i=0

S(hnki+j, k)AG(hnki+j, k)

(b) Inner loop first

Figure 3.7: Descent trees corresponding to different downranking orderings of GT non-terminal associated withIm ⊗ A⊗ Ik.

3.7.3 Loop Distribution

Loop distribution splits a single loop into two consecutive loops, each executing a half of the loopbody. It can potentially improve cache utilization, by using cache lines more effectively [72].

Table 3.8 shows an example of loop distribution for a trivial program.

Original code After loop distribution

for ( int i=0; i<15; ++i)

y[i] = x[i];

y[100+i] = x[100+i];

for ( int i=0; i <15; ++i)

y[i] = x[i];

for ( int i=0; i <15; ++i)

y[100+i] = x[100+i];

Table 3.8: Loop distribution example.

The transformation shown in Table 3.8 could speedup the code if the target machine had acache that could hold a single cache line, consisting of at least 2 data elements. The original codeloads x[i], but since the entire cache line is loaded, x[i + 1] is loaded also. However, due to limitedcapacity, x[i + 1] is evicted by loading x[100 + i]. The transformed code uses entire cache lines,avoiding the unnecessary eviction.

Mathematical preliminaries. Loop distribution optimization is enabled by the distributivityproperty of the tensor product, shown below:

(I ⊗AB) = (I ⊗A)(I ⊗B).

The above can be generalized to allow arbitrary input, output and intermediate data formats, given


by the permutations P , Q and R:

Q(I ⊗AB)P = Q(I ⊗A)R−1 · R(I ⊗B)P.

The two important special cases are R = P−1 and R = Q, which means that one of the stages canbe performed inplace:

Q(I ⊗AB)P = Q(I ⊗A)P · P−1(I ⊗B)P︸︷︷︸

inplace

,

orQ(I ⊗AB)P = Q(I ⊗A)Q−1

︸︷︷︸

inplace

· Q(I ⊗B)P,

This tensor product property translates to the following GT property:

GT(A · B, g, s, v) = GT(A, f, s, v) ·GT(B, g, f, v). (3.49)

Above, the function f describes the intermediate data format. The two special cases are f = g(GT(B, g, f, v) can be done inplace), and f = s (GT(A, f, s, v) can be done inplace).

This property holds for arbitrary rank GTs.

Implementation in Spiral. To be able to use (3.49) as a breakdown rule in Spiral, theremust exist descent trees with nodes of the form GT(A ·B, g, s, v). However, even the GT-baseddescent trees do not contain such nodes. See, for instance, Figure 3.6.

In order to obtain the desired nodes we add an extra degree of freedom for descending into GTnodes. The original way of descending is to apply a GT breakdown rule, such as GT-DownRank.The extra degree of freedom also allows descending into the child of the GT. In the case of theDFT, it would like as follows:

GT(DFT, g, s, v) DFT-CT−−−−−→ GT((DFT⊗I) diag(I ⊗DFT)L, g, s, v)= GT(GT(DFT diag, h, h, ·) ·GT(DFT, h, h, ·), g, s, v).

Above, after applying the Cooley-Tukey breakdown rule, we can downrank the outer GT. However,this is no different than performing the downrank operation first, and applying the Cooley-Tukeybreakdown rule afterward. If the two steps are done in reverse order, the descent tree is slightlydifferent, but the result is the same as in Figure 3.6.

Therefore the only motivation to apply the breakdown (Cooley-Tukey rule) to the child of theGT first (instead of the GT itself) is to obtain the structure suitable for loop distribution. Withthis in mind, we made the loop distribution rule (3.49) into a rewrite rule, applied unconditionally.The decision to perform loop distribution lies in the fact that we descended into the child of the GT,instead of descending into the GT itself. The more flexible option is to make it into a breakdownrule, due to the degree of freedom in the choice of f .

The final loop distribution breakdown (or rewrite, depending on implementation) rule is statedbelow:

GT(GTA ·GTB , g, s, v) GT-Distr−−−−−→ GT(GTA, f, s, v) ·GT(GTB , g, f, v), f ∈ s, g.

To obtain a rewrite rule we must decide a priori the strategy for choosing f .


3.7.4 Strip-Mining

Both parallelization and vectorization of loops rely on a transformation called strip-mining [134,139]. Strip-mining transforms a loop into a doubly nested loop, where the inner loop performsa “strip” of the iterations of the original loop. Because this transformation is not beneficial byitself, we only use specialized variants tailored for parallelization and vectorization, as explained inChapter 4. However, for consistency, we explain strip-mining here.

Strip-mining is a degenerate form of loop tiling. Table 3.9 shows an example of strip-mining.

Original code After strip-mining

for ( int i=0; i <16; ++i)

y[i] = x[i];

for ( int i=0; i<4; ++i)

for ( int j=0; j<4; ++j)

y[4*i+j] = x[4*i+j];

Table 3.9: Strip-mining example.

Mathematical preliminaries. The underlying tensor product property is

(Imn ⊗Ak) = (Im ⊗ In ⊗Ak),

and with the most general dataflow pattern (given by permutations P and Q) it is

P (Imn ⊗Ak)Q = P (Im ⊗ In ⊗Ak)Q,

Converting the above to the GT representation yields

GT(Ak, q h0, 1,k, p h0, 1,k, mn) = GT(Ak, q h0, 1,k,km, p h0, 1,k,km, m,n).

Formulating the transformation in the general case, i.e., with general index mapping functions fand g becomes non-trivial because the functions’ must be split across the partitioned iterationspace. To solve this problem, we introduce a rank splitting operator for functions.

split(h0, 1,k, 1,m) = h0, 1,k,km,

split(q h0, 1,k, 1,m) = q h0, 1,k,km,

split((j1, i) 7→ f(j1, i), 1,m) = (j2, j1, i) 7→ f(mj2 + j1, i).

The splitting operator above increases the rank of the function through the implied splitting of theoriginal loop into two nested loops with m inner iterations. The second parameter of split denoteswhich loop is being split. For rank-1 functions the only possible value is 1.

Splitting a rank-k function f yields a rank-(k + 1) function, as shown below:

split(fA→B, t,m) : Zk+1 ×A→ B

: (jk, . . . , jt+1, j′t, j

′′t , jt−1, . . . , j1, i) 7→ f(jk, . . . , jt+1,mj′t + j′′t , jt−1, . . . , j1, i).

Using the split operator we can formulate the general strip-mining transformation for GT (which

3.8. INPLACENESS 67

we also call “split”) as below:

GT(A, g, s, mn) GT−Split−−−−−−→ GT(split(A, 1,m), split(g, 1,m), split(s, 1,m), m,n) (3.50)

Implementation in Spiral. Strip-mining is only required by vectorization and parallelization,thus we only implement specialized variants, as described in Chapter 4.

3.8 Inplaceness

Mathematical preliminaries. The fundamental inplace loop originates from the tensor productIn⊗Ak, where Ak is a square k× k matrix. Suppose we want to perform the inplace computation:

x← (In ⊗Ak)x,

x0

...xkn−1

←

Ak

. . .

Ak

.

x0

...xkn−1

.

The goal is to perform n matrix vector products of the (square) Ak with different sections of x,and locally overwrite the corresponding sections with the results. Thus we can perform the distinctmatrix vector products independently.

To express this property formally, we introduce the inplace tag Inplace(·), which denotes thatthe formula must be transformed, such that input and output vectors can be safely aliased. Theabove observation can now be formally restated in the formula language as

Inplace(In ⊗Ak) =(In ⊗ Inplace(Ak)

).

The above can be generalized to handle an arbitrary data layout given by a permutation P , as

Inplace(P−1(In ⊗Ak)P ) = P−1(In ⊗ Inplace(Ak))P. (3.51)

Note that the initial P and final P−1 express the fact that input and output array are permutedequally. Above the tensor product representation already becomes ambiguous and inadequate, sincein the right-hand side of (3.51) P is outside the tensor product, and does not seem related to anyinplaceness property at all.

In the GT domain, the equivalent of (3.51) takes a particularly clean and unambiguous form:

Inplace(GT(Ak, f, f, n)) = GT(Inplace(Ak), f, f, n). (3.52)

In other words, the data must be read and written using the same index mapping function f . IfP = perm(p), then f = p h0, 1k. We will call (3.52) an inplace GT.

Inplace GT. Due to our simple parametrization algorithm, an inplace GT as in (3.52) is“parametrized away” after parametrization. Consider the following example:

GT(A, hk→km0, 1,k , hk→km

0, 1,k , n) Par−−→ GT(A, hu1→u2

0, 1,u3, hu1→u4

0, 1,u5, u6) (3.53)

In the right-hand side of the above, u2 6= u4 and u3 6= u5, and thus it is no longer an inplace GT.


To solve this problem, we could introduce additional constraints on the index mapping functions.However, the root of the problem is in the redundant representation of an inplace GT, namely theindex mapping function is stored twice, as gather and scatter function. Thus, a cleaner and moreefficient solution is to introduce a special non-redundant notation for an inplace GT.

Below we will denote v = v1, . . . , vm in rank-m GTs for brevity. For square A, a rank-minplace GT is represented as

GTI(A, f, v) =∑

vm

· · ·∑

v1

S(f)AG(f), |v | = m.

Using GTI, the parametrization that did not work as expected in (3.53) now takes the form:

GTI(A, hk→kn0, 1,k , n)) Par−−→ GTI(A, hu1→u2

0, 1,u3, u4)

Not only the inplaceness property is preserved in the parametrized form, but also the parametrizedformula requires less parameters, which means that we can expect less overhead in the implemen-tation (e.g., less register pressure).

Rewriting with GTI. To be able to rewrite complicated∑

-SPL expressions with GTI, weneed equivalents of the rules in Table 2.7. These equivalents take a somewhat unexpected form,and were first formulated by Franz Franchetti [55].

The origin of the rules is the simple tensor product transformation

P (I ⊗A)→(P (I ⊗A)P−1

)P → Inplace(P (I ⊗A)P−1)P. (3.54)

With one-stage formulas, there is no clear advantage. However, with two-stage formulas as in manyDFT and other trigonometric transform algorithms, moving P to the right-hand side of the secondstage allows it to be fused with the first stage. The most dramatic effect is achieved by applying(3.54) to the prime-factor FFT (2.2). In the original form (2.2) has two out-of-place stages, andafter applying (3.54) both stages become inplace:

DFTn → V −1m,k(DFTk ⊗Im)(Ik ⊗DFTm)Vm,k

→ V −1m,k(DFTk ⊗Im)Vm,kV

−1m,k(Ik ⊗DFTm)Vm,k

→ Inplace(V −1m,k(DFTk ⊗Im)Vm,k) · Inplace(V −1

m,k(Ik ⊗DFTm)Vm,k).

This transformation is extensively used in [121,122] to manipulate different FFT algorithms.

Converting (3.54) into the GT representation yields

perm(p) ·GTI(A, f, v)→ GTI(A, p−1 f, v) · perm(p). (3.55)

Obviously, there exists a dual rule (obtained by transposing (3.55)):

GTI(A, f, v ) · perm(p)→ perm(p) ·GTI(A, p f, v). (3.56)

Clearly, its impossible to have both (3.55) and (3.56) as rewrite rules, because rewriting would notterminate. Choosing only one of these rules, on the other hand, breaks symmetry, and disallowstransposing breakdown rules, a standard mechanism in Spiral. Our solution is to include neither

3.9. EXAMPLES OF COMPLICATED CLOSURES 69

of the rules, but use the alternative directional rules below:

S(p) ·GTI(A, f, v )→ GTI(A, p f, v) · S(p), (3.57)

GTI(A, f, v ) ·G(p)→ G(p) ·GTI(A, p f, v ). (3.58)

These rules are somewhat counter intuitive, but they are a generalization of (3.55), and triviallyfollow from the fact that G(p)S(p) = In, n = domain(p) and GTI is necessarily a square matrix.

Even though neither scatter nor gather constructs are eliminated by (3.57) and (3.58), theyare propagated and can be fused with the subsequent formula stage, as sketched in the followingexample.

Example 2 (Descending into the recursion step (3.10) with Cooley-Tukey FFT (3.1) using GTI)

S(h∗, 1)DFTG(h∗, ∗)

〈apply (3.1)〉 → S(h∗, 1) · (DFT⊗I) diag(Ω)(I ⊗DFT)L ·G(h∗, ∗)

〈convert to GT〉 → S(h∗, 1) ·GTI(DFT diag(∗), h0, ∗,1, ∗)·GT(DFT, h0, ∗,1, h0, 1,∗, ∗) ·G(h∗, ∗)

〈apply (3.57)〉 → GTI(DFT diag(∗), h∗, ∗,1, ∗)· S(h∗, 1) ·GT(DFT, h0, ∗,1, h0, 1,∗, ∗) ·G(h∗, ∗)

〈apply (3.35)–(3.36)〉 → GTI(DFT diag(∗), h∗, ∗,1, ∗) ·GT(DFT, h∗, ∗,∗, h∗, 1,∗, ∗)

The above example also demonstrates that the recursive implementation of (3.10) can be donewith no intermediate storage, if non-aliased input and output vectors x and y are given. The rightGT works from x into y, and the GTI works inplace in y.

In FFTW, the above is the rationale for forcing the recursion step (3.11) to always be a basecase, and thus allowing only right-expanded descent trees.

3.9 Examples of Complicated Closures

Tables 3.10–3.11 show the examples of closures generated for three different transforms (DFT,RDFT and DCT-4). Table 3.10 shows the closures using standard

∑-SPL notation, and Table 3.11

shows the same closures using the GT notation, which also explicitly shows how the inplaceness isexploited (via GTI) in the case of DFT. The closures were generated using the following breakdownrules:

• DFT – Cooley-Tukey FFT (2.1);

• RDFT – real Cooley-Tukey FFT equivalent (2.6), and (2.17) for the auxiliary transform;

• DCT-4 – “Cooley-Tukey”-type algorithm (2.12), and (2.17), (2.18) for the auxiliary trans-forms.

In Fig. 3.8 we show the call graphs that correspond to these closures. Note, that we have restrictedthe recursion (and thus the closure generation) to a subset of all possibilities. For example, theCooley-Tukey FFT (used in Fig. 3.8(a)) reexpresses the DFT as two smaller DFTs. We restrictthe further application of the Cooley-Tukey FFT to one of these smaller DFTs. The same is done


RS 1

RS 2

RS 3

RS 5

RS 4

RS 6 RS 7

(a) DFT

RS 1

RS 2 RS 3 RS 4

RS 5 RS 6

RS 7 RS 8 RS 9

RS 10

(b) RDFT

RS 1

RS 2 RS 3

RS 5 RS 4

RS 6 RS 7

RS 8

(c) DCT-4

Figure 3.8: Call graphs for the generated libraries with looped recursion steps (corresponding to Table 3.11).

for RDFT and DCT-4. This means that some nodes in the call graphs are “dead-end”, i.e., theyhave no outgoing edges. Since recursion is not possible for these nodes, they must be implementedas base cases.

Example of a parameter equality constraints. For the more complicated closures the linearsystem of equations that arises from the parameter equality constraints during parametrization isno longer completely trivial, but still very simple. For example, for the recursion step 3 in theRDFT closure in the parameter equality constraints were

2u8 = 2u1,

u2 = u3,

2u1 = u2.

There is only one connected component. There are four unknowns and three equations. Fixing u1

as a constant, and solving for all other parameters gives:

u2 = 2u1,

u3 = 2u1,

u8 = u1

After these constraints were incorporated into the parametrized recursion step we got the resultsshown in Tables 3.10 and 3.11. The equations above also explain how the parametrized recursionsteps can have expressions such as 2u1, rather than plain parameters ui.

Redundancy in the closure. The automatically obtained closures are not minimal. Forexample, consider the DFT closure in Fig. 3.8(a). The nodes RS 4–7 form a so-called subclosure.A subclosure is any smaller closure contained in a larger one. Any recursion step in the subclosure


DFT

1 : DFTu1

2 :∑

u7

S(hu1→u5

0, u6,1)DFTu1

diag(pre(u3

Z×u1→C))G(hu1→u5

0, u6,1)

3 :∑

u8

S(hu1→u6

0, 1,u7)DFTu1

G(hu1→u3

0, u4,1)

4 : S(hu1→u2

u3, 1)DFTu1

G(hu1→u6

u7, u8)

5 : S(hu1→u5

u6, u7)DFTu1

diag(pre(u3

u1→C))G(hu1→u5

u6, u7)

6 :∑

u8

S(hu1→u5

u6, u7,1)DFTu1

diag(pre(u3

Z×u1→C))G(hu1→u5

u6, u7,1)

7 :∑

u11

S(hu1→u8

u9, 1,u10)DFTu1

G(hu1→u3

u4, u5,u6)

RDFT

1 : RDFTu1

2 : S(hu1→u2

0, u3⊗ ı2)RDFT2u1

G(h2u1→u6

0, 1)

3 :∑

u11

S(ru1→u7

1, u9,1, u10⊗ ı2)diag

(Cu1

)rDFT2u1

(λ-wrap(λ1Z×Z→R))G(h2u1→u4

u5, 1,u6)

4 :∑

u8

S(hu1→u6

0, u7,1)URDFTu1

G(hu1→u3

0, u4,1)

5 : S(ru2→u1

u3, u4, u5⊗ ı2)diag

(Cu2

)rDFT2u2

(λ-wrap(λ1Z→R))G(h2u2→u9

u10, 1)

6 : S(hu1→u2

u3, u4)URDFTu1

G(hu1→u7

u8, u9)

7 : S(h2u6→u2

u3, u4 hu5→u6

0, u7⊗ ı2)URDFT2u5

G(h2u5→u10

0, 1)

8 :∑

u15

S(h2u11→u8

u9, u10 ru1→u11

1, u13,1, u14⊗ ı2)diag

(Cu1

)rDFT2u1

(λ-wrap(λ1Z×Z→R))G(h2u1→u4

u5, 1,u6)

9 :∑

u10

S(hu1→u8

0, u9,1)URDFTu1

G(hu1→u3

u4, u5,u6)

10 : S(h2u5→u2

u3, u4 ru6→u5


(Cu6

)rDFT2u6


u14, 1)

DCT-4

1 : DCT-4u1

2 :∑

u13

S(r2u8→u9

0, u11,1, u12)diag

(N2u8

)RDFT-3⊤

2u8rcdiag(pre(u4

Z×2u8→R))G(h2u8→u6

0, 1,u7 ℓ2u8

u8)

3 :∑

u10

S(hu1→u8

0, u9,1)RDFT-3u1

diag(Nu1

)G(ru1→u3

0, u5,1, u6)

4 : S(hu1→u2

u3, u4)RDFT-3u1

diag(Nu1

)G(ru1→u7

u9, u10, u11)

5 : S(r2u13→u1

u3, u4, u5)diag

(N2u13

)RDFT-3⊤

2u13rcdiag(pre(u9

2u13→R))G(h2u13→u11

u12, 1 ℓ2u13

u13)

6 :∑

u14

S(h2u10→u7

u8, u9 ru1→u10

0, u12,1, u13⊗ ı2)diag

(Cu1

)rDFT2u1

(1/4)G(h2u1→u4

0, 1,u5)

7 :∑

u12

S(hu1→u10

0, u11,1)RDFT-3u1

diag(Nu1

)G(ru1→u3

u5, u6,u7, u8)

8 : S(h2u5→u2

u3, u4 ru6→u5


(Cu6

)rDFT2u6

(1/4)G(h2u6→u13

u14, 1)

Table 3.10: Generated recursion step closures for DFT, RDFT, and DCT-4 with looped recursion steps in index-free

P

-SPL.


DFT

1 : DFTu1

2 : GTI(DFTu1diag

(pre(u3

Z×u1→C)), hu1→u5

0, u6,1, u7)

3 : GT(DFTu1, hu1→u3

0, u4,1, hu1→u6

0, 1,u7, u8)

4 : S(hu1→u2

u3, 1)DFTu1

G(hu1→u6

u7, u8)

5 : GTI(DFTu1diag

(pre(u3

u1→C)), hu1→u5

u6, u7, )

6 : GTI(DFTu1diag

(pre(u3

Z×u1→C)), hu1→u5

u6, u7,1, u8)


u4, u5,u6, hu1→u8

u9, 1,u10, u11)

RDFT

1 : RDFTu1

2 : S((hu1→u2

0, u3⊗ ı2))RDFT2u1

G(h2u1→u6

0, 1)

3 : GT(diag(Cu1

)rDFT2u1

(λ-wrap(λ1Z×Z→R)), h2u1→u4

u5, 1,u6, (ru1→u7

1, u9,1, u10⊗ ı2), u11)

4 : GT(URDFTu1, hu1→u3

0, u4,1, hu1→u6

0, u7,1, u8)

5 : S((ru2→u1

u3, u4, u5⊗ ı2))diag

(Cu2

)rDFT2u2


u10, 1)

6 : S(hu1→u2

u3, u4)URDFTu1

G(hu1→u7

u8, u9)

7 : S(h2u6→u2

u3, u4 (hu5→u6

0, u7⊗ ı2))URDFT2u5

G(h2u5→u10

0, 1)

8 : GT(diag(Cu1

)rDFT2u1


u5, 1,u6, h2u11→u8

u9, u10 (ru1→u11

1, u13,1, u14⊗ ı2), u15)

9 : GT(URDFTu1, hu1→u3

u4, u5,u6, hu1→u8

0, u9,1, u10)

10 : S(h2u5→u2

u3, u4 (ru6→u5


(Cu6

)rDFT2u6


u14, 1)

DCT-4

1 : DCT-4u1

2 : GT(diag(N2u8

)RDFT-3⊤

2u8rcdiag(pre(u4

Z×2u8→R)), h2u8→u6

0, 1,u7 ℓ2u8

u8, r2u8→u9

0, u11,1, u12, u13)

3 : GT(RDFT-3u1diag

(Nu1

), ru1→u3

0, u5,1, u6, hu1→u8

0, u9,1, u10)

4 : S(hu1→u2

u3, u4)RDFT-3u1

diag(Nu1

)G(ru1→u7

u9, u10, u11)

5 : S(r2u13→u1

u3, u4, u5)diag

(N2u13

)RDFT-3⊤

2u13rcdiag(pre(u9

2u13→R))G(h2u13→u11

u12, 1 ℓ2u13

u13)

6 : GT(diag(Cu1

)rDFT2u1


0, 1,u5, h2u10→u7

u8, u9 (ru1→u10

0, u12,1, u13⊗ ı2), u14)

7 : GT(RDFT-3u1diag

(Nu1

), ru1→u3

u5, u6,u7, u8, hu1→u10

0, u11,1, u12)

8 : S(h2u5→u2

u3, u4 (ru6→u5


(Cu6

)rDFT2u6


u14, 1)

Table 3.11: Generated recursion step closures for DFT, RDFT, and DCT-4 (of Table 3.10) with looped recursionsteps using GT.

can be implemented using only other recursion steps in the same subclosure. On the call graphthe subclosure is a subgraph which has no outgoing edges that connect to nodes outside of thesubgraph. Indeed, starting from RS 4 we can only reach nodes RS 4–7.

It is very often the case that one of the elements of the subclosure can be used to implementthe root of the original closure. If we look up the definition of RS 4 in Table 3.10, we find it to be


a DFT with a strided scatter and gather. Indeed, RS 1 can be easily implemented using RS 4 bysetting u3 = u7 = 0 (gather/scatter bases) and u8 = 1 (gather stride) in RS 4.

The same kind of closure reduction can be done for the RDFT. Starting from RS 6, the onlyreachable nodes are RS 6–10. Thus, we have a smaller subclosure with only 5 recursion steps(instead of 10). Strictly speaking, RS 6 is not a more general version of RS 1 (see Table 3.10), butit can still be used to implement RS 1, using a close relationship between RDFT and URDFT.

To summarize the above method, the size of the closure can be reduced by 1) finding the smallestclosed subgraph in the original call graph (i.e., a subclosure); 2) determining how to implement RS1 using one of the nodes in the subclosure; 3) using the subclosure instead of the original closure.In the case of the DFT, this procedure is equivalent to eliminating the more specialized variantsrecursion steps in favors of less specialized. In the case of the RDFT, a conversion between RDFTand URDFT is necessary to apply the method. In the case of the DCT-4, this method does notwork, since no other recursion steps contains the DCT-4 or related transform.

An alternative way to reduce the closure size is by fusing pairs of recursion steps. A pairof recursion steps can be fused into a single one, if either of the recursion steps can be used toimplement the other. For example, RS 1 and RS 4 in the DFT closure can be fused, because RS4 can implement RS 1 as we already explained. After fusing RS 1 and RS 4, we obtain the newclosure consisting of RS 4–7 only, which is the same as using the subclosure method.

However, fusion is more generally applicable, because it does not require the existence of thesubclosure. For example, it can be applied to the DCT-4, when the subclosure method fails. Forexample, RS 3 and RS 7 for the DCT-4 can be fused.

Eliminating the specialized variants to reduce the closure size using either of the methods canlead to performance degradation. For example, we see this for small transform sizes in FFTW asexplained in Section 6.2.4 (Chapter 6, see Fig. 6.7).

To date, we have not implemented any of these reduction schemes, however, we believe that bothmethods can be successfully automated. If code size is important, the closure reduction methodsbecome a crucial tool in reducing the code size.

Chapter 4

Library Generation: Parallelism

To obtain the highest possible performance on the current computers the programmer must exploittwo levels of parallelism: vector parallelism and thread parallelism. Fig. 4.1 graphically shows the

Implementationspace

Vector length

Number of threads

Figure 4.1: The space of implementations including both vector and thread parallelism.

implementation space within both types of parallelism. Our goal is to be able to span this entirespace with our generator.

For example, a common computer available today might have 4 processor cores, and support4-way SIMD parallelism. In this case the maximum possible speedup from correctly exploitingparallelism is 16x.

For typical numeric code, exploiting vector parallelism involves focusing on inner loops andparallelism within operations. Exploiting thread parallelism means focusing on the outer loops.

With∑

-SPL framework, however, we have a unique situation: using loop interchange and loopdistribution outer loops can become inner loops, and thus we are very flexible. In fact the sameloop can be used for both vector and thread parallelism.

In the following two sections we explain how we enable both kinds of parallelism in the librarygenerator. We follow the same outline. First, we explain how this can be done at the SPL level,perhaps with some restrictions, and then explain the more general method at the

∑-SPL level. In

the case of vectorization, we have not yet translated all SPL vectorization techniques to the∑

-SPL,and only a subset is available, which results in suboptimal, but very acceptable performance.

75

76 CHAPTER 4. LIBRARY GENERATION: PARALLELISM

4.1 Vectorization

4.1.1 Background

Many current general purpose and DSP architectures provide short vector SIMD (single instruction,multiple data) instructions, which operate in parallel on the elements of a vector register, typicallyof length 2 to 16. Introduction of SIMD vector instructions is a relatively straightforward way toadd parallelism to an existing CPU data-path, and as the clock frequency scaling is becoming lessviable, SIMD extensions are becoming ubiquitous.

For instance, Intel and AMD defined over the years MMX, SSE, SSE2, SSE3, SSSE, SSE4,SSE5, 3DNow!, Extended 3DNow!, and 3DNow! Professional as x86 SIMD vector extensions. Mod-ern PowerPC architectures provide AltiVec SIMD instruction set, and Cell SPU and BlueGene/Lhave their own custom vector extensions. Additional extensions are defined by PA-RISC (MAX),MIPS (MIPS-3D) and ARM (Wireless MMX) processor architectures, and also many VLIW DSPprocessors.

Although exploiting SIMD instructions does not require a new programming model (unlikethreads), it requires either programming in assembly or using low-level compiler intrinsics availablein C and C++. Another alternative is using the compiler vectorization, although it remains ratherlimited, as manifested by the fact that most high performance libraries rely on human-vectorizedcode.

Previous work. There is a large body of work on vectorizing compilers for the vector super-computers of 1980s. For example, [6,139]. With the advent of SIMD in mainstream platforms, thetopic has received renewed attention. Current SIMD architectures have shorter vectors, strict dataalignment requirements, and other differences

The most commonly used technique for vectorizing programs for modern SIMD platforms isvectorization of the inner loops. For example, see [75,90,91]. Vectorization in the popular GNU CCompiler (GCC) is described in [85]. Intel C/C++ Compiler also provides a vectorizer, describedin [25]. An excellent overview of the difficulties of vectorization for several relevant multimediaprograms is given in [104].

Another technique for vectorization called superword-level parallelization (SLP) was proposedby Larsen, et. al. in [76, 77, 107] and independently in the context of the DFT by Kral, et al.in [49,56,74]. SLP attempts to group isomorphic statements in order to obtain a short vector. Yetanother form of SLP, which exploits the 2-way parallelism within the complex number arithmetic,was proposed in FFTW 3.x [61].

Interestingly, SLP can be combined with traditional vectorization to obtain longer vectors, asalso shown in [61]. Similarly, one can vectorize several loops at once to produce longer vectors [136].

Modern SIMD processors impose strict alignment constraints. An approach to vectorizationwith unaligned data is discussed in [46,135]. Vectorizing with disjoint data sets is discussed in [86].

Data alignment handling and vectorization of disjoint data are useful optimizations which areorthogonal to the general vectorization. We expect these methods to be also expressible on theSPL and/or

∑-SPL level and be potentially beneficial to the performance of our generated code,

since we did not fully address these issues.

The scope of our work. In this thesis we propose a form of domain specific vectorization,which vectorizes the loops implied from the SPL or

∑-SPL constructs. Both inner and outer loops

can be vectorized, and unless there exist a transform specific vectorization rule, the outer loops are

4.1. VECTORIZATION 77

vectorized to minimize the vector shuffle overhead.

We worked with Intel’s SSE, SSE2 and SSE3 extensions. These extensions are available onmost current Intel and AMD processors. Collectively, they define six 128-bit modes: 4-way single-precision and 2-way double-precision floating point vectors; and 2-way 64-bit, 4-way 32-bit, 8-way16-bit, and 16-way 8-bit integer vectors. We denote the vector length of a SIMD extension modewith ν.

We only consider the 2- and 4-way floating point modes, since these are by far the most widelyused and relevant for linear transforms. Our approach is not limited to ν = 4, and in our earlierwork [54] we already show results with 8-way and 16-way vectors.

For the code examples with vector instructions we will use the Intel C compiler’s intrinsicsinterface (also supported by the GNU C compiler). The interface includes vector data types thatcorrespond vector registers or aligned memory locations, and vector operations that correspond toSSE instructions.

As an example, below we show a short instruction sequence using the two SSE intrinsic opera-tions _mm_add_ps and _mm_sub_ps, which correspond to vector addition and subtraction:

/* Intel C SSE intrinsics , 4-way 32-bit floating -point data -type */

__m128 x[2], y[2];

y[0] = _mm_add_ps(x[0], x[1]);

y[1] = _mm_sub_ps(x[0], x[1]);

In the rest of this section we first explain at a high level how the vectorization is performed atthe SPL level of abstraction using a combination of specialized breakdown and rewrite rules. Thisis the result of our previous work described in [53]. Next, we explain how the vectorization can beadapted for generating libraries. In order to work with libraries, the SPL vectorization rules hadto be “ported” to the

∑-SPL level.

4.1.2 Vectorization by Rewriting SPL Formulas.

In Spiral, vectorized and parallelized implementations are generated by first obtaining “fully opti-mized” SPL formulas, which means they can be mapped to efficient vector or parallel code. Thisis possible because certain SPL constructs express parallelism. For example, I ⊗A corresponds toa parallelizable loop with no loop carried dependencies.

In this section, our goal is to take formulas obtained by the recursive application of breakdownrules like (2.1) and automatically manipulate them into a form that enables a direct mappinginto SIMD vector code. Further, we also want to explore different vectorizations for the sameformula. The solution is a suitably designed rewriting system that implements our previous ideasfor formula-based vectorization in [50,51].

To produce SIMD vector code, Spiral needs the following four components:

• Vectorization tags which introduce the vector length, and mark the formula “to be vectorized”.

• Vector formula constructs which denote the subformulas can be perfectly mapped to SIMDcode.

• Rewriting rules which transform general formulas into vector formulas.

• Vector backend that emits SIMD code from vector formulas.


Vectorization tags. We introduce a set of tags to propagate vectorization information throughthe formulas and to perform algebraic simplification of permutations. Note that all objects remainmatrices.

We tag a formula construct A to be translated into vector code for vector length ν by

Vecν (A) .

Vector formula constructs. The central formula construct that can be implemented on allν-way short vector extensions is

A⊗ Iν , (4.1)

where A is an arbitrary real matrix. Vector code is obtained by generating scalar code for A (i.e.,for x 7→ Ax) and replacing all scalar operations by their respective ν-way vector operations. Forexample, c=a+b is replaced by c=vadd(a,b).

We writeA⊗Iν (4.2)

to stipulate that the tensor product is to be mapped into vector code as explained above.

Of course, most formulas do not match (4.2). In these cases we manipulate the formula usingrewriting rules to consist of components that either match (4.2) or are among a small set of basecases. It turns out that for a large class of formulas the only base cases needed are

L2ν2

︸︷︷︸

base

, L2νν

︸︷︷︸

base

, Lν2

ν︸︷︷︸

base

, Dν︸︷︷︸

base

(4.3)

where Dν is a real ν × ν diagonal matrix. Above, we used a tag “base” to mark the base cases.

Constructs marked with ⊗ and “base” are final, i.e., they will not be changed by rewriting rules.

For vectorizing complex formulas, additional base case are needed, we do not show them here,but refer the reader to [53] for details.

Definition 1 We call a formula ν-way vectorized if it is either of the form (4.2) or one of the formsin (4.3), or of the form

Im ⊗A or AB, (4.4)

where A and B are ν-way vectorized.

Rewriting rules. In our framework the rewriting starts from the tagged formula Vecν (A)and proceeds by applying rules that vectorize the formula in the sense of Definition 1. Most of ourrewriting rules are shown in Tables 4.1–4.2.

As an example of applying the rewriting rules we explain how Im⊗A is handled by our rewritingsystem. We start with the tagged formula

Vecν (Im ⊗A) ,

which means “Im ⊗ A is to be vectorized.” The system can only apply one of the alternatives ofrule (4.9). Suppose it chooses the first alternative, which yields

Im/ν ⊗ (Vecν (Iν ⊗A))


Vecν (Lnνn ) →

(In/ν ⊗ Lν2

ν︸︷︷︸

base

)(Ln

n/ν⊗Iν

)(4.5)

Vecν (Lnνν ) →

(Ln

ν⊗Iν

)(In/ν ⊗ Lν2

ν︸︷︷︸

base

)(4.6)

Vecν (Lmnm ) →

(Lmn/ν

m ⊗Iν

)(Imn/ν2 ⊗ Lν2

ν︸︷︷︸

base

)((In/ν ⊗ Lm

m/ν)⊗Iν

)(4.7)

Table 4.1: SPL vectorization rules for the stride permutation.

Vecν (A⊗ Im)→(A⊗ Im/ν

)⊗Iν (4.8)

Vecν (Im ⊗A)→

Im/ν ⊗Vecν

((Iν ⊗A

))

Vecν (Lmnm )Vecν (A⊗ Im)Vecν (Lmn

n )(4.9)

Vecν

((Im ⊗A

)Lmn

m

)→

Vecν (Lmnm )

(Vecν (A⊗ Im)

)

(

Im/ν ⊗Vecν (Lnνν )(A⊗Iν

))(L

mn/νm/ν ⊗Iν

) (4.10)

Vecν

((

Ik ⊗(Im ⊗An×n

)Lmn

m

)

Lkmnk

)

→ Vecν

(

Lkmk ⊗ In

)(

Im ⊗Vecν

((Ik ⊗An×n

)Lkn

k

))

Vecν (Lmnm ⊗ Ik) (4.11)

Table 4.2: SPL vectorization rules for tensor products. A is an n× n matrix.

and then applies the second alternative of (4.9) to (Iν ⊗A), which leads to

Im/ν ⊗(Vecν (Lnν

ν ) (A⊗Iν)Vecν (Lnνn )).

Next, only rules (4.5) and (4.6) match, which yields

Im/ν ⊗(

(Lnν⊗Iν)(In/ν ⊗ Lν2

ν︸︷︷︸

base

)(A⊗Iν)(In/ν ⊗ Lν2

ν︸︷︷︸

base

)(Lnn/ν⊗Iν)

)

.

This is the final vectorized formula.

4.1.3 Vectorization by Rewriting∑

-SPL Formulas

To make this approach compatible with library generation we convert the tagged SPL formulasto tagged

∑-SPL formulas and convert vectorization/parallelization rules from Table 4.2 to GT

breakdown rules.

Index-free∑

-SPL and GT representation exposes the data access pattern of loops. Mostimportantly, we can easily recognize tensor products at such representation, by looking at the indexmapping functions. Moreover since the constants (such as vector stride) survive parametrization,the tensor products can still be recognized after the parametrization. Thus, the original SPL


G(fn→N ) =∑

j

S(hj, 1)I1G(f hj, 1) = GT(I1, h0, 1,1, f h0, 1,1, n),

perm(f) = G(f) = GT(I1, h0, 1,1, f h0, 1,1, n),S(f) = G(f)⊤ = GT(I1, f h0, 1,1, h0, 1,1, n)

diag(dn→C

)=∑

j

S(hj, 1) diag(f hj, 1

)G(hj, 1) = GT(diag

(f h0, 1,1

), h0, 1,1, h0, 1,1, n).

Table 4.3: Converting other constructs to GT.

vectorization of Spiral is guaranteed to be portable to the parametrized index-free∑

-SPL withGTs.

Since GT can describe a larger formula space using only few orthogonal constructs, we canvectorize a larger variety of formulas. From another perspective, since GT represents a loop,standard loop vectorization techniques for short vectors [25, 75, 85] apply, and we can recast ourvectorization method using standard compiler terminology.

The GT based vectorization for libraries has two separate phases:

• finding vectorizable loops (and marking them for vectorization);

• vectorizing marked loops.

This is similar to a general purpose vectorizing compiler [25,85], however there are important differ-ences. Unlike in a general purpose compiler our vectorization approach is a global transformation,which can vectorize across multiple recursive function calls. As a consequence, we are not limitedto vectorizing inner loops, but in fact most of the time, vectorize outermost loops. Finally, we arenot limited to unit-stride loops.

Finding vectorizable loops. In a general purpose compiler, a vectorizable loop is the innerloop without loop-carried dependencies that satisfies some additional constraints, e.g., all com-putations are of the same data type, all data accesses are unit-stride, and there are no functioncalls.

Interestingly, in∑

-SPL loop-carried dependencies do not exist, and we impose no other con-straints on vectorizable loops. Thus, surprisingly, all loops are vectorizable. Clearly, the bestchoices are the loops with largest bodies, i.e., the outer loops.

To find maximal vectorizable loops in∑

-SPL formula F we start by tagging it as Vecν (F ),and then propagate the tag down using only a pair of rewrite rules below:

Vecν (A ·B)→ Vecν (A) ·Vecν (B) ,

Vecν (A + B)→ Vecν (A) + Vecν (B) .

Quickly, the rewriting will yield tagged GTs, and in few cases, other standalone constructs, namely,G, S, perm, diag and unstructured sparse matrices. We do not handle vectorizing unstructuredsparse matrices, these only occur in formulas for small transforms, obtained using less structuredalgorithms, which are not practical for large transforms. G, S, perm and diag can be triviallyconverted into iterative sums and thus GT, using rules in Table 4.3,


Vectorizing marked loops. In the previous step we find maximal vectorizable loops (GTs).If it happens that other constructs are tagged, they are also converted to GT. Thus the remainingstep is to vectorize these GTs. The standard loop vectorization procedure consists of three substeps:

1. strip-mining;

2. maximal loop distribution (and loop interchange in our case);

3. replacement by vector operations.

We perform the first step using a new GT breakdown rule GT-Vec, a specialized variant of strip-mining (3.50). Second step is performed by the loop transformation infrastructure described inSection (3.7), namely by rules GT-DownRank and GT-Distr ((3.48) and (3.49)). The final step isperformed by the Spiral backend, which maps

∑-SPL to code.

GT-Vec: strip-mining for vectorization. Strip-mining splits a single loop into two nestedloops. For vectorization, we make the inner loop consists of ν iterations (vector length), and tagthe inner loop with the so-called it sticky vector tag.

Vecν (GT(A, g, s, n)) GT−Vec−−−−−→ GT(split(A, 1, ν), split(g, 1, ν), split(s, 1, ν), ν, n/ν)

We call the tagged inner loop on the right-hand side of above a sticky vector loop. The goal of therewrite system is to maximally distribute the sticky loop. The goal of the backend is to implementsticky vector loops using vector instructions. However, since these loops are maximally distributed,the backend will only have to handle few orthogonal

∑-SPL constructs.

In general purpose vectorizing compilers [25, 85] the scalar operations are replaced by vectorequivalents immediately after strip mining, which precludes vectorization to be applied to outerloops. Our method is one possible solution to this problem. The analysis required by the generalpurpose compiler to vectorize an outer loop in such scenario might be too prohibitive. The natureof∑

-SPL, however, guarantees the legality of loop distribution and exchange in all cases.

Maximal loop distribution and loop interchange. In this step we maximally distributethe sticky loop into multiple loops with a single “statement” as a body of each loop. The meaningof “single statement” is context dependent, but here we mean a single

∑-SPL construct.

To accomplish the maximal distribution, we unconditionally apply the loop distribution ruleGT-Distr to the sticky loop, and disallow downranking of the tagged sticky loop. Proceedingthis way, iterating downranking and distribution will push the sticky loop all the way down. Forexample:

GT (GT ·GT )→ GT (GT ) ·GT (GT )→ GT 2 ·GT 2

Above we denoted rank-2 GTs with GT 2. The rank-2 GTs can be downranked to the rank-1 GTwith a sticky vector loops, and the process can be continued as we recurse deeper and deeper.

Replacement by vector operations. Eventually the recursion has to be terminated with abase case, and because the sticky vector loop cannot be downranked the base cases will necessarilybe of the form

GT(A, g, s, ν).This plays the role of A ⊗ Iν in the SPL vectorization, and means that A must be implementedusing vector arithmetic instead of the regular scalar operations. However, the important differencein the

∑-SPL case, is that the data required for different vector slots might not be contiguous in


Vec2(DFT32)DFT-CTA, f = 4

Vec2(GT(DFT8, h0, 4,1, h0, 1,8, 4))GT-Vec

Vec2(GT(DFT4 diag(∗), h∗, 8,1, h∗, 8,1, 8))GT-Vec

GT(DFT8, h0, 4,1,2, h0, 1,8,16, 2, 2)GT-DownRank, f = 2

GT(DFT4 diag(∗), h∗, 8,1,2, h∗, 8,1,2, 2, 4)GT-DownRank, f = 2

GT(DFT8, h∗, 4,1, h∗, 1,8, 2)DFT-CTA, f = 2

GT(DFT4 diag(∗), h∗, 8,1, h∗, 8,1, 2)Base case

GT(DFT4, h∗, 8,1,4, h∗, 1,8,4, 2, 2)GT-DownRank

GT(DFT2 diag(∗), h∗, 4,8,1, h∗, 4,8,1, 2, 4)GT-DownRank

GT(DFT4, h∗, 8,1, h∗, 1,8, 2)Base case

GT(DFT2 diag(∗), h∗, 4,8, h∗, 4,8, 2)Base case

Figure 4.2: Descent tree for Vec2(DFT32).

memory. If this is the case, then subvector loads and stores can be used to populate the vectorselement-by-element. Although this is generally slower, this enables the

∑-SPL vectorization to

work for a wider class of formulas than the SPL vectorization.

The gather and scatter functions g and s must be analyzed to determine whether subvectoraccess is required. The simplest case occurs when both g and s are of the form h0, s,1 and ν|s,which means that the vectors are contiguous in memory, and can be loaded and stored usingstandard vector load and store instructions.

4.1.4 Vectorized Closure Example

In Fig. 4.2 we show a descent tree for a vectorization of DFT32 implemented using (3.1). Thedescent tree should be compared to the non-vectorized descent tree in Fig. 3.5

In Table 4.4 we show the recursion step closure and the corresponding call graph for the 2-wayvectorized DCT-4. The closure has 16 recursion steps, many of which are rank-1 or rank-2 GTswith sticky vector loops of 2 iterations. These loops are translated to vector instructions in the


generated code.


1 : Vec2(DCT-4u1)

2 : Vec2(GT(diag(N2u8

)RDFT-3⊤

2u8rcdiag(pre(u4

Z×2u8→R)), h2u8→u6

0, 1,u7 ℓ2u8

u8, r2u8→u9

0, u11,1, u12, u13))

3 : Vec2(GT(RDFT-3u1diag

(Nu1

), ru1→u3

0, u5,1, u6, hu1→u8

0, u9,1 , u10))4 : GT(diag

(N2u9

)RDFT-3⊤

2u9rcdiag(pre(u4

Z×Z×2u9→R)), h2u9→u6

0, 1,u7,u8 ℓ2u9

u9, r2u9→u10

0, u12,1,2, u13, 2, u14)

5 : GT(diag(N2u9

)RDFT-3⊤

2u9rcdiag(pre(u4

Z×2u9→R)), h2u9→u6

u7, 1,u8 ℓ2u9

u9, r2u9→u10

u12, u13,1, u14, u15)

6 : GT(RDFT-3u1diag

(Nu1

), ru1→u3

0, u5,1,2, u6, hu1→u8

0, u9,1,2, 2, u10)7 : GT(RDFT-3u1

diag(Nu1

), ru1→u3

u5, u6,1, u7, hu1→u9

u10, u11,1, u12)8 : S(hu1→u2

u3, u4)RDFT-3u1

diag(Nu1

)G(ru1→u7

u9, u10, u11)

9 : S(r2u13→u1

u3, u4, u5)diag

(N2u13

)RDFT-3⊤

2u13rcdiag(pre(u9

2u13→R))G(h2u13→u11

u12, 1 ℓ2u13

u13)

10 : GT(diag(N2u9

)RDFT-3⊤

2u9rcdiag(pre(u4

Z×2u9→R)), h2u9→u6

u7, 1,u8 ℓ2u9

u9, r2u9→u10

u12, u13,1, u14, 2)

11 : GT(RDFT-3u1diag

(Nu1

), ru1→u3

u5, u6,1, u7, hu1→u9

u10, u11,1, 2)12 : GT(diag

(Cu1

)rDFT2u1


0, 1,u5, h2u10→u7

u8, u9 (ru1→u10

0, u12,1, u13⊗ ı2), u14)

13 : GT(diag(Cu1

)rDFT2u1

(λ-wrap(λ1Z×Z×Z→R)), h2u1→u4

u5, u6,1,u7, h2u12→u9

u10, u11,1 (ru1→u12

0, u14,0,1, u15⊗ ı2), 2, u16)


(Nu1

), ru1→u3

u5, u6,1,u7, u8, hu1→u10

u11, u12,1,u13, 2, u14)


(Nu1

), ru1→u3

u5, u6,u7, u8, hu1→u10

0, u11,1 , u12)16 : S(h2u5→u2

u3, u4 (ru6→u5


(Cu6

)rDFT2u6


u14 , 1 )

17 : GT(diag(Cu1

)rDFT2u1


u5, u6,1 , h2u11→u8

u9, u10,1 (ru1→u11

u13, u14, u15⊗ ı2), 2)

RS 1

RS 2 RS 3

RS 4 RS 5 RS 6 RS 7

RS 10 RS 9 RS 11 RS 8

RS 13 RS 14 RS 12 RS 15

RS 16RS 17

Table 4.4: Recursion step closure and call graph for the 2-way vectorized DCT-4.

4.2. PARALLELIZATION 85

4.2 Parallelization

4.2.1 Background

The need for threads. After years of exponential growth, the CPU frequencies of recent gener-ations of microprocessors have practically stalled, a consequence of the physical limits imposed bytheir power density. To keep Moore’s Law on track, chip makers have started to follow a differentroute: multicore processors, also called chip multiprocessors (CMPs), that integrate multiple pro-cessor cores onto one chip. Dual- and quad-core processors are currently sold by Intel, IBM, andAMD. IBM’s Cell processor has eight special-purpose cores on one chip. The latest generation ofmulticore processors already targets consumer desktop and laptop computers, making multicore amainstream technology.

With this in mind, threading support becomes a defacto requirement of any high-performancelibrary. Programmers in charge of developing high performance libraries are already confrontedwith the difficult task of optimizing for deep memory hierarchies and extracting the fine-grainparallelism for vector instruction sets. Now, this challenge is compounded with extracting coarse-grain parallelism and multithreaded programming.

Most of the multicore processors are built with parallel shared memory architecture, whichmeans that several processor cores have a single shared main memory, but can possibly have privatecaches. Shared memory architectures impose the shared memory programming model, which meansthat no explicit communication is required in software, and the threads communicate by accessingthe same memory locations. In this thesis we only consider shared memory architectures.

Compilers. Today’s parallelizing compilers grew out of a vast body of research in in the late1980’s – early 90’s [12,66,134,139]. The resulting compilers are quite successful and provide goodperformance scaling for relative simple programs. However, any highly optimized library supportingvector instructions and exploiting the memory hierarchy is far from being simple. Unfortunately,parallelizing a large multifunction program into coarse grain threads is still beyond the reach ofany parallelizing compiler.

Nevertheless, the compilers can greatly help with parallel programming by providing a simpleand portable way to create and control threads. In this thesis we used the OpenMP languageextension [35], which is a great example of this. OpenMP extends C/C++ (or Fortran) by directivesinlined into the source code as preprocessor pragmas (C/C++) or comments (Fortran) to passparallelization information to the compiler and also includes a supporting runtime library.

Our solution. Building on our previous work we extend the library generator to automaticallygenerate threaded code. We follow the same general approach as with vectorization, starting withthe formula tags and tagged SPL constructs, and finally obtaining parallel code.

As with vectorization, we first show how to formally derive threaded transform algorithms suit-able for shared memory parallelism by rewriting the SPL formulas. Next we extend this approachto∑

-SPL, which makes it more generally applicable, and covers a larger set of transforms.

The first approach, operating at the SPL level, is applicable only to a subset of transforms andalgorithms, but enables us to reason about desirable properties; in particular, we can prove thatthe algorithms offer perfect load-balancing and avoid false sharing. The second approach is moregenerally applicable, but the makes the analysis more difficult.

We will sketch the SPL parallelization to give the point of reference, but mostly focus on∑

-SPLparallelization, which is used for generating libraries.

Challenges of programming for shared memory multiprocessors. Writing fast parallel


programs is considerably more challenging than writing fast sequential programs. For problemsthat are data parallel (but not embarrassingly parallel) the programmer (or the library generator)must address the following issues:

1) Load balancing: All processors should have an equal amount of work assigned. In particular,sequential parts should be avoided since they limit the achievable speed-up due to Amdahl’slaw.

2) Synchronization Overhead: Synchronization should involve as little overhead as possible andthreads should not wait unnecessarily at synchronization points.

3) Avoiding false sharing: Private data of different processors should not be in the same cache line,since this leads to cache thrashing and thus severe performance degradation.

Usually, in addition to the above, special care must be taken to avoid excessive locking of sharedvariables, deadlocks and race conditions. However, Spiral generated libraries do not require locks,and are free of deadlocks and race conditions by construction, because only loops with fully inde-pendent iterations are parallelized.

Parallel transform algorithms. Most of the previous work on parallel transform algorithmsconcentrates on parallel DFTs. [70] shows how to design parallel DFT algorithms for various ar-chitecture constraints using the Kronecker product formalism and is a major influence on ourwork. [122] gives a good overview of sequential and parallel DFT algorithms. The major problemwith using the standard Cooley-Tukey FFT algorithm (2.1) on shared memory machines is itsmemory access pattern: Large strides, and consecutive loop iterations touch the same cache lines,which leads to false sharing.

The governing idea of many parallel algorithms [10,87,106] is to reorder the data in explicit stepsto remove false sharing introduced by strided memory access. For example, the six-step algorithm,

DFTmn → Lmnm (In ⊗DFTm)Lmn

n Dm,n(Im ⊗DFTn)Lmnm (4.12)

has embarrassingly parallel computation stages of the form Ir ⊗DFTs. The three stride permu-tations in (4.12) are executed separately as explicit matrix transpositions, i.e., data permutations.These transpositions are further optimized, e.g., through blocking [3], and partially folded into theadjacent computation stages [117,118]. A different optimization approach reduces communicationby increasing the computation by using O(n2) algorithms instead of fast O(n log n) algorithms toremove dependencies on small subproblems [4].

FFTW 3 [61] offers a state-of-the-art multithreading implementation of the DFT and relatedtransforms. FFTW parallelizes one- and multi-dimensional DFTs by allowing its search mechanismto parallelize the loops that occur inside the algorithms. It uses advanced loop optimizationsto avoid cache problems (loop distribution and loop exchange) and supports thread pooling tominimize the startup cost of parallel computation. Hand-written implementation of threadingin FFTW is roughly equivalent to our parallelization on the

∑-SPL level, which also works by

parallelizing loops.

4.2.2 Parallelization by Rewriting SPL Formulas

In this section we explain our first approach, which parallelizes SPL formulas. The parallelization ofSPL formulas is performed similarly to vectorization. We notice that SPL constructs have a direct


interpretation in terms of parallel code. For example, the tensor products in (2.1) are essentiallyloops with fully independent iterations (no loop-carried dependencies) and known memory accesspatterns. Next, we automatically rewrite a Spiral generated formula to obtain a structure suitablefor mapping into efficient multi-threaded code.

A formula fully determines the memory access of the final program (as a function of the loopvariables), and thus using rewriting we can statically schedule the loop iterations across p pro-cessors to ensure load balancing and eliminate false sharing. For general programs, proving theindependence of loop iterations and determining such a schedule is a hard problem requiring ex-pensive analysis [12]. The implied loops in SPL formulas, on the other hand, have by definitionindependent iterations, and known memory access patterns. Thus, as we show, we can find such adesired schedule efficiently.

We first explain the parallelizing rewriting system, and then show the application to the DFT,effectively deriving a novel variant of the Cooley-Tukey FFT (2.1) different from (4.12) and well-suited for multicore systems.

The extension of Spiral to support shared memory parallelism requires four components:

• Parallelism tags which introduce the relevant hardware parameters into the rewriting system.

• Parallel formula constructs which denote the subformulas that can be perfectly mapped toshared memory parallel platforms.

• Rewriting rules which transform general formulas into parallel formulas.

• Parallel backend that maps parallel formulas into parallel C or C++ code (in our case usingOpenMP extensions).

Parallelism tags. The two most relevant parameters of shared memory parallel machines arethe number of processors p and the cache line length µ of the most important cache level. Wemeasure µ in the elements of the input signal. For instance, for a cache line length of 64 bytes(= 512 bits) and complex double precision data type (= 2×64 = 128 bits), µ = 4. These parametersare introduced into the rewriting system with a parallel tag, for example,

Parp,µ (A) .

In addition, we assume that all shared data vectors are aligned at cache line boundaries in the finalprogram.

Parallel formula constructs. For arbitrary A and Ai the expressions

y =(Ip ⊗A

)x, and y =

(p−1⊕

i=0

Ai

)

x, with A,Ai ∈ Cmµ×mµ

express embarrassingly parallel computation on p processors as they express block diagonal ma-trices with p blocks. Requiring the matrix dimensions to be multiples of µ ensures that duringcomputation each cache line is owned by exactly one processor, preventing false sharing. If all Ai

have the same computational cost, programs implementing these constructs become load balanced.Data shuffling of the form

y =(P ⊗ Iµ

)x, with P permutation,


Parp,µ (AB)→ Parp,µ (A)Parp,µ (B) (4.15)

Parp,µ (Am ⊗ In)→ Parp,µ

((Lmp

m ⊗ In/p

)(Ip ⊗ (Am ⊗ In/p)

)(Lmp

p ⊗ In/p

))(4.16)

Parp,µ (Lmnm )→

Parp,µ

(

Ip ⊗ Lmn/pm/p

)

Parp,µ

(Lpn

p ⊗ Im/p

)

Parp,µ

(Lpm

m ⊗ In/p

)Parp,µ

(

Ip ⊗ Lmn/pm

) (4.17)

Parp,µ (Im ⊗An)→ Ip ⊗‖(Im/p ⊗An

)(4.18)

Parp,µ (P ⊗ In)→(P ⊗ In/µ)⊗Iµ, (4.19)

Table 4.5: SPL shared memory parallelization rules. P is any permutation.

reorders blocks of µ consecutive elements and thus whole cache lines are reordered. On sharedmemory machines this means that such reordering passes make the threads exchange ownership ofentire cache lines, avoiding false sharing. Recall also, that at

∑-SPL level, these permutations are

combined with adjacent computation blocks.We introduce tagged versions of the tensor product and direct sum operators in Spiral:

Ip ⊗‖ A,

p−1⊕

i=0

‖Ai, P⊗Iµ, with A,Ai ∈ Cmµ×mµ. (4.13)

These are the same matrix operators as their untagged counterparts, but declare that a constructis fully optimized for shared memory machines and does not require further rewriting. By fullyoptimized we mean that the formula is load-balanced for p processors (provided the Ai have equalcomputational cost) and avoids false sharing. This property is preserved for products of theseconstructs.

Definition 2 We say that a formula is load-balanced (avoids false sharing) if it is of the form (4.13)or of the form

Im ⊗A or AB, (4.14)

where A and B are load-balanced formulas (formulas that avoid false sharing). A formula is fullyoptimized (for shared memory) if it is load-balanced and avoids false sharing.

The goal of the rewriting system (explained next) is to transform formulas into fully optimizedformulas.

Rewriting rules. Table 4.5 summarizes the rewriting rules sufficient for parallelizing a subsetof SPL formulas, including the Cooley-Tukey FFT (2.1). These rules form the core of the paral-lelization. All parameters in the rules are integers, and thus an expression n/p on the right-handside of a rule implies that the precondition p|n must hold for the rule to be applicable.

As an example, consider an example of applying rule (4.16) that encodes a form of loop strip-mining and scheduling.

F2 ⊗ Inapply (4.16)−−−−−−−→

(L2p

2 ⊗ In/p

)(Ip ⊗ (F2 ⊗ In/p)

)(L2p

p ⊗ In/p

). (4.20)

The left-hand side, represents a loop with unit stride between iterations. On the right-hand side,


the construct Ip⊗ (F2⊗ In/p) corresponds to a double loop with p iterations in the outermost, and

n/p in the innermost loop; the permutations L2p2 ⊗ In/p and L2p

p ⊗ In/p represent data reindexingwhich is merged with the tensor product by Spiral’s loop merging stage (Section 2.4).

To produce the final code, Spiral further applies rules (4.15) and (4.17)–(4.19) and performsloop merging. The resulting pseudo-code for the is shown below.

for ( int tid = 0; tid < p; ++tid) /* parallel loop , thread id = tid */

for ( int j = 0; j < n/p; ++j)

y[tid*n/p + j] = x[tid*n/p + j] + x[tid*n/p + j + n];

y[tid*n/p + j + n] = x[tid*n/p + j] - x[tid*n/p + j + n];

Above the p iterations of outer loop are distributed among p threads. Each processor executesn/p consecutive iterations of the original loop given by the left-hand side of (4.20) and touches 2contiguous memory areas of n/p complex numbers. If µ|(n/p), then each processor “owns” 2n/(pµ)cache lines.

Rule (4.15) expresses that in products of matrices each factor will be rewritten separately. (4.16)and (4.18) handle tensor products with identity matrices. Both rules distribute the computationalload evenly among the p processors and execute as many consecutive iterations as possible on thesame processor (as shown above). Rule (4.17) breaks stride permutations into two stages: oneperforms stride permutations locally for each processor, the other permutes consecutive chunks ofdata. (4.16) and (4.17) require the subsequent application of (4.15), (4.18), and (4.19) to fully breakdown to parallel formula constructs (4.13). Tensor products of a permutation and a sufficientlylarge identity matrix are broken into cache line resolution by (4.19).

The rules in Table 4.5 are based on known formula identities summarized in [51, 53, 70]. Theyreplace the usually expensive analysis required for the associated loop transformations by cheappattern matching and also encode the actual transformation.

Parallel backend. Extending Spiral’s implementation level to support shared memory parallelcode is straightforward. The only thing we have to add is the translation of the constructs

Ip ⊗‖ A and

p−1⊕

i=0

‖Ai

into parallel code for p threads. There is no need for special support for P⊗Iµ as these permutationsare simply merged with the adjacent loops by

∑-SPL optimizations.

We use OpenMP to generate parallel C code. The variables defined outside of the parallelconstructs must be shared between the threads, and variables defined inside must be thread local.Since this is the default behavior of OpenMP, no special flags are needed. For example, the loopcorresponding to y = (Ip ⊗‖ F2)x is implemented in OpenMP as:

/* one iteration per thread */

#pragma omp parallel num_threads(p)

int tid = omp_get_thread_num();

/* apply F_2 to subvectors of x */

y[2*tid] = x[2*tid] + x[2*tid + 1];

y[2*tid + 1] = x[2*tid] - x[2*tid + 1];

/* wait until all threads finish work */


#pragma omp barrier

Note, that we do not use the “parallel for” construct provided by OpenMP, because implementationusing barriers turns out to be faster, and more flexible.

Example: parallelizing the Cooley-Tukey FFT. Now, we apply the rewrite rules fromTable 4.5 to derive a parallel version of the Cooley-Tukey FFT (2.1). In Spiral these steps areperformed automatically. The result is a version of the Cooley-Tukey FFT that is fully optimizedfor shared memory in the sense of Definition 2.

We start by specifying that we want to compute y = DFTN x on p processors with cache linesize µ,

Parp,µ (DFTN ) .

In the first step rule (2.1) chooses a factorization of N = mn and breaks DFTN into DFTm andDFTn. Next, rule (4.15) propagates the parallelization tag to all factors:

Parp,µ (DFTmn)→ Parp,µ

((DFTm⊗In

)Dm,n

(Im ⊗DFTn

)Lnm

m

)(4.21)

→ Parp,µ (DFTm⊗In)Parp,µ (Dm,n)Parp,µ (Im ⊗DFTn)Parp,µ (Lnmm ) .

(4.22)

We now consider each of the four factors on the right-hand side of (4.22) separately. The firstfactor is rewritten by applying (4.16) (requiring p|n),

Parp,µ (DFTm⊗In)→ Parp,µ

(Lmp

m ⊗ In/p

)Parp,µ

(Ip ⊗ (DFTm⊗In/p)

)Parp,µ

(Lmp

p ⊗ In/p

),

followed by (4.15), (4.18), and (4.19) (requiring µ|n/p),

Parp,µ (DFTm⊗In)→((Lmp

m ⊗ In/pµ)⊗Iµ

)(Ip ⊗‖ (DFTm⊗In/p)

)((Lmp

p ⊗ In/pµ)⊗Iµ

). (4.23)

The second factor in (4.22) is a diagonal which will be merged into the tensor product, thus thetag is simply dropped, and the third factor is handled using (4.18) (requiring p|m),

Parp,µ (Im ⊗DFTn)→ Ip ⊗‖(Im/p ⊗DFTn

). (4.24)

The remaining fourth factor in(4.22) is parallelized by the first choice of rule (4.17) (requiring p|m)followed by (4.15), (4.18), and (4.19) (requiring µ|m/p),

Parp,µ (Lmnm )→ Parp,µ

(

Ip ⊗ Lmn/pm/p

)

Parp,µ

(Lpn

p ⊗ Im/p

)(4.25)

→(Ip ⊗‖ L

mn/pm/p

)((Lpn

p ⊗ Im/pµ)⊗Iµ

). (4.26)

Collecting (4.23)–(4.26) and the constraints required for applying the rules leads to the final expres-sion output by our rewriting system, (4.27) displayed in Figure 4.3, with the requirement pµ|m andpµ|n. Inspection shows that (4.27) is fully optimized for shared memory in the sense of Definition 2.

As a small example, Figure 4.4 shows the C99 OpenMP program generated by Spiral from (4.27)with m = 4, n = 2, p = 2, and µ = 2 after loop merging and fully unrolling the code for DFT2

and DFT4.

Discussion. Parallelization on the SPL level only works for a subset of SPL formulas. In


Parp,µ (DFTmn)→((Lmp

m ⊗ In/pµ)⊗Iµ

)(Ip ⊗‖ (DFTm⊗In/p)

)((Lmp

p ⊗ In/pµ)⊗Iµ

)

Dm,n

(

Ip ⊗‖ (Im/p ⊗DFTn))(

Ip ⊗‖ Lmn/pm/p

)((Lpn

p ⊗ Im/pµ)⊗Iµ

)(4.27)

Figure 4.3: Multicore Cooley-Tukey FFT for p processors and cache line length µ.

// C99 OpenMP DFT_8, call by a sequential function

#include <omp.h>

static _Complex double D[8] =

1, 1, 1, 0.70710678118654+__I__*0.70710678118654,

1, 1, 1, -0.70710678118654+__I__*0.70710678118654

;

void DFT_8(_Complex double *Y, _Complex double *X)

static _Complex double T[8];

#pragma omp parallel for

for(int i1 = 0; i1 <= 1; i1++)

_Complex double s1, s2;

for(int i2 = 0; i2 <= 1; i2++)

s1 = X[2*i1 + i2];

s2 = X[4 + 2*i1 + i2];

T[4*i1 + 2*i2] = s1 + s2;

T[4*i1 + 2*i2 + 1] = s1 - s2;

#pragma omp parallel for

for(int i3 = 0; i3 <= 1; i3++)

_Complex double s3, s4, s5, s6, s7, s8, s9, s10;

s10 = D[i3]*T[i3];

s9 = D[4 + i3]*T[4 + i3];

s8 = s10 + s9;

s4 = D[6 + i3]*T[6 + i3];

s7 = D[2 + i3]*T[2 + i3];

s6 = s7 + s4;

s5 = s10 - s9;

s3 = __I__*(s7 - s4);

Y[i3] = s8 + s6;

Y[4 + i3] = s8 - s6;

Y[2 + i3] = s5 + s3;

Y[6 + i3] = s5 - s3;

Figure 4.4: Multithreaded C99 OpenMP function computing y = DFT8 x using 2 processors, called by a sequentialprogram.


particular, it is possible to parallelize Cooley-Tukey FFT, multi-dimensional transforms, WHT,and Haar wavelet (which uses 2-tap filters).

4.2.3 Parallelization by Rewriting∑

-SPL Formulas

To port SPL parallelization to∑

-SPL we need to convert the tensor product parallelization break-down rules to GT breakdown rules. Since GT can describe a larger formula space, the GT break-down rules will need to be more general.

Similarly to tensor products, the GT loops are, in parallel Fortran terminology, “DOALL” loops,which means that they have independent iterations and are trivially parallelizable. So the moreimportant questions are 1) which loops to parallelize and 2) how to avoid false sharing and ensureload balancing.

Currently, our strategy is to parallelize the top-level loops only. This is not necessarily alwaysthe best strategy. Sometimes it may make sense to parallelize deeper in the loop nest to effectivelyexploit the shared caches. This can be rather easily accomodated by adding additional breakdownrules, and having the library runtime adaptation mechanism select the best strategy.

False sharing is avoided in the same way as parallelizing SPL formulas, namely by staticallyscheduling the threads in the special way. In some cases, this is does not avoid false sharing entirely,but provides a good heuristic.

We now discuss how this works in more detail. In order to parallellize by rewriting∑

-SPLformulas, we will need the following components, which we explain next.

1. Parallelization tags (these are the same as before).

2. Parallel∑

-SPL formula constructs, which denote formulas with parallel interpretation, inthis case we also need few more low-level constructs, such as a barrier.

3. Rewrite rules that manipulate general∑

-SPL formulas into parallel ones.

4. Parallel backend that emits parallel OpenMP code from parallel∑

-SPL constructs (it is astraightforward extension of the backend explained in Section 4.2.2, and we do not discuss ithere).

Parallelization tags. We will use the same parallelization tag as before, to denote paralleliza-tion:

Parp,µ (A) .

Parallel∑

-SPL constructs. The following parallel∑

-SPL constructs are needed

• Parallel sum, expresses a loop which iterations must be executed in parallel:

p−1∑

‖,i=0

Ai

• Barrier, threads wait until all peers finish computing A:

barrier(A)


Rewrite rules:

Parp,µ (AB)→ barrier (A) barrier (B) , (4.28)

Parp,µ (A + T )→ tidp−1 (A) + Parp,µ (T ) . (4.29)

Breakdown rules:

Parp,µ (T )GT-Par−−−−−−−→

p|n

p−1∑

‖,j=0

d(T, j), (4.30)

Parp,µ (T )GT-ParND−−−−−−−→

p∤n

p−2∑

‖,j=0

d(T, j) + tidp−1 (GT(d(A, p−1), d(f, p−1), d(g, p−1), n mod p)) ,

(4.31)

Parp,µ (S)GTI-Par−−−−−−−→

p|n

p−1∑

‖,j=0

d(S, j), (4.32)

Parp,µ (S)GTI-ParND−−−−−−−→

p∤n

p−2∑

‖,j=0

d(S, j) + tidp−1 (GTI(d(A, p − 1), d(f, p− 1), n mod p))

(4.33)

d(a, x) = down(split(a, 1, ⌈n/p⌉), x, 2) (4.34)

(4.35)

Table 4.6:P

-SPL shared memory parallelization rules. T = GT(A, f, g, n), S = GTI(A, f, n).

• Thread ID assignment, A must execute on thread i only:

tidi(A)

Rewrite and breakdown rules. Parallelization is a achieved via a mixture of breakdownand rewrite rules. As earlier, breakdown rules are used when alternatives exist, and provide thepossibility to search over the alternatives.

Table 4.6 shows the rewrite and breakdown rules to parallelize at the∑

-SPL level with GTnonterminals. We provide parallelization for the cases when number of processors p does not dividethe number of loop iterations n.

The rules that we show assume that parallelized GTs are rank 1. In Spiral we implemented ageneral version of these rules, that always parallelizes the outer loop.

The rationale of all GT loop parallelization rules is to assign to each thread an equal sizechunk of consecutive iterations. This coincides with the strategy that OpenMP uses to parallelize’parallel for’ loops, and also with the guaranteed false sharing free rules of SPL parallelization. Theiterations space partition is shown in Fig. 4.5. When parallelizing GT loops, the false sharing free


N iterations

Thread 1 Thread p-1

N/p iterations per thread

Thread 0 . . .

i=0 i=N-1

Figure 4.5: GT iteration space partition implied by GT parallelization rules (4.30)–(4.33).

guarantee only holds for specific index mapping functions, for other index mapping functions it isjust a heuristic, which tends to work very well.

To find the loops to parallelize the first pair of rewrite rules are used. Note, that there is nogeneral rule for Parp,µ (A + B), but only for the case when B is a GT nonterminal (i.e. rule (4.29)).The reason is twofold. First, we did not need such a rule for the transform breakdown rules thatwere used to generate libraries (Table 6.1), and a special variant (4.29) was sufficient. Second, itis no longer clear how to statically schedule more than 2 threads with summation like A + B. Thebest strategy will most likely be dispatching portion of threads to compute A and another portionto compute B and parallelizing within A and B. The only way to find the best schedule would beto use multiple breakdown rules (instead of rewrite rules) to enable the search at runtime.

4.2.4 Parallelized Closure Example

Table 4.7 shows the closure generated for Parp,µ (DFT) implemented via the Cooley-Tukey break-down rule (DFT-CT). The following breakdown rules were used to construct the closure: DFT-CT(2.1), GT-DownRank (3.47), GT-Par (4.30), GT-ParND (4.31). The non-parallelized equivalent isthe closure in Table 3.11 and Fig. 3.8(a).

The shown parallel closure has 8 recursion step, while the non-parallelized closure in Table 3.11has 7.


1 : Par2(DFTu1)

2 : Par2(GT(DFTu1, hu1→u3

0, u4,1 , hu1→u6

0, 1,u7, u8))

3 : Par2(GTI(DFTu1diag

(pre(u3

Z×u1→C)), hu1→u5

0, u6,1 , u7))4 : GT(DFTu1

, hu1→u3

u4, u5,1, hu1→u7

u8, 1,u9, u10)

5 : GTI(DFTu1diag

(pre(u3

Z×u1→C)), hu1→u5

u6, u7,1, u8)6 : GTI(DFTu1

diag(pre(u3

u1→C)), hu1→u5

u6, u7, )

7 : S(hu1→u2


u7, u8)


u4, u5,u6, hu1→u8

u9, 1,u10, u11)

RS 1

RS 2

RS 3RS 4

RS 5

RS 7

RS 6

RS 8

Table 4.7: Recursion step closure and call graph for the 2 thread parallelized DFT. Compare to the DFT closure inTable 3.11 and call graph in Fig. 3.8(a).

Chapter 5

Library Generation: LibraryImplementation

In Chapter 3 we described the Library Structure module of the library generation framework (seeFig. 3.1), which was responsible for generating the recursion step closure, and performing largescale structural optimizations for parallelization and vectorization. The generated recursion stepclosure defines the final structure of the library in an abstract way. It can also be viewed as aprogram for the hypothetical

∑-SPL virtual machine, which can interpret formulas.

The goal of the Library Implementation module is to translate or compile the program for thehypothetical

∑-SPL virtual machine into a program in some lower level target language. However,

given the declarative and very high-level nature of∑

-SPL there are many different ways in whicha recursion step closure can be compiled into a library.

In particular, there are different kinds of libraries, for example, adaptive and fixed, floating-point and fixed-point, etc. Our goal is to be able to generate all of these kinds of libraries. We willcall the particular set of requirements for the generated library together with the target languagethe library target.

As part of this thesis we implemented both adaptive and fixed C++ floating-point libraries.Other potentially interesting library targets are extensions of existing libraries with generated code;fixed-point arithmetic based libraries; and libraries in C or Java.

It is also useful to choose an existing library, such as Intel IPP, as the library target. In thiscase the generated code is meant to extend the existing library with new functionality. Generatingnew code for an existing library requires the generated code to follow certain established codingand naming conventions, and satisfy other requirements. These requirements must be incorporatedinto the library target specification.

Because the Library Implementation module does not require any transform specific code, ifsupport is added for a new library target, one can immediately generate a high-performance libraryfor the new target. At the same time the formula level transformations can proceed independentlyof the chosen library target.

5.1 Overview

At the high level the task of this module is to combine the transform recursion and base caseswithin the library infrastructure of the chosen library target. This boils down to generating the

97

98 CHAPTER 5. LIBRARY GENERATION: LIBRARY IMPLEMENTATION

“glue code”, which will address the following problems:

• Degree of freedom in choosing the breakdown rules. If more than one breakdown rule isapplicable to the recursion step, there needs to be a dispatch mechanism which can selectwhich method is invoked at runtime.

• Integration of base cases. In order for the recursion to eventually terminate, the code for thebase cases needs to be integrated into the dispatch mechanism.

• Degrees of freedom within the breakdown rule. The dispatch mechanism must be able tochoose the appropriate values for degrees of freedom, such as the value of k|n in (3.1).

• Initialization code. Each transform requires some initialization code. The tasks of this codeinclude memory allocation for the temporary buffer, and scaling factors, precomputation ofscaling factors, and so forth.

We would like to enable program generation for different library targets. Each library targethandles the above issues differently. Possible library targets include:

1. A standalone non-adaptive library.

2. A standalone adaptive library.

3. An extension of an existing non-adaptive (e.g., Intel MKL) or adaptive library (e.g., FFTW).

In the case of standalone libraries, some infrastructure must be provided (i.e. memory management,error handling, and, for adaptive libraries, search capabilities) to be used along with the generatedtransform and glue code.

Discussion. If we compare our implementation goal above with existing FFT libraries, we geta hint towards the solution.

In virtually all FFT libraries the degrees of freedom within a breakdown rule are chosen accord-ing to some static strategy. However, the strategy itself is inseparable from the implementation.The choice of breakdown rules is typically handled by cascading IF statements, which encode thestatic strategy.

Notable exceptions are FFTW [61] and UHFFT [84]. These libraries make the choices for thedegrees of freedom based on runtime feedback. Both libraries build up a plan before computing thetransform. Among other things, the plan stores the necessary buffers and precomputed constants.

The handling of initialization code depends on whether the library is adaptive or not. Inadaptive libraries, initialization uses the plan to figure out what to precompute and how manybuffers are needed, and also for storing initialization data. In adaptive libraries the initialization ismost conveniently expressed recursively. In non-adaptive libraries there is no universal standard.

Our approach. We will unify the adaptive and non-adaptive library targets by using a plan-like recursive data structure in both cases, and keeping the choice of values for the degrees offreedom orthogonal to other issues. In adaptive libraries the plan can built up dynamically, and infixed libraries the plans may be fixed, or use the fixed hard-coded strategy. Besides the recursionstrategy, the plan also serves as a convenient place to store initialization data and allocate temporarymemory buffers used in the computation.

5.2. RECURSION STEP SEMANTICS AS HIGHER-ORDER FUNCTIONS 99

5.2 Recursion Step Semantics as Higher-Order Functions

The descriptor system. Ideally, the recursion step closure should translate into a set of mutuallyrecursive functions. In reality, however, things are more complicated.

Mathematically, the semantics parametrized recursion step is a function of the form

(u1 × · · · × un ×X)→ Y.

For example, the semantics of the simple recursion step DFTu1is a function

(u1 ×X)→ Y, u1 ∈ N, X, Y ∈ Cu1

However, it is desired that some of the parameters, for example the transform size, are providedearlier in order to precompute some constants used in the algorithm. Precomputation allows tocompensate expensive trigonometric computations if the transform of the same length is computedseveral times. This is standard practice, used in Intel IPP library, FFTW, and others. Computingthe 32-point DFT using IPP, for example, looks as follows:

IppsDFTSpec_C_64fc *f;

ippsDFTInitAlloc_C_64fc(f, 32, IPP_FFT_NODIV_BY_ANY , ippAlgHintNone);

ippsDFTFwd_CToC_64fc(X, Y, f, NULL);

The first two lines initialize the so-called “DFTSpec” structure f, setting the transform size to 32and selecting unscaled transform via IPP FFT NODIV BY ANY. The third line computes the transformof X and writes the result into Y. Casting this back into the recursion step framework, the parametersof the recursion step become available at different times.

This temporal discontinuity can be accommodated by currying the semantics function, as in

u1 → (X → Y ).

In the general case, both inner and outer functions have several parameters. We will call theparameters of the outer function “cold” (above u1 is cold) and the parameters of the inner function“hot” (above X is hot). Thus we have:

coldparams → (hotparams → Y ), params = coldparams ∪ hotparams.

It is the second occasion on which the higher-order functions (functions returning functions) nat-urally come into play. The first use of higher-order functions is in parametrizing the generatingfunctions of

∑-SPL diag(·) construct, as described in Section 3.3.3.

Clearly, many target languages, such as C or C++, do not directly support higher-order func-tions. Our strategy is to generate intermediate code that uses higher-order functions, and theneliminate them using the process known as closure conversion1 or lambda lifting [63, 71]. As theresult, the descriptor system used in Intel IPP and FFTW naturally arises.

In object-oriented languages closures can be expressed with objects (and vice versa [63, 73]).For example, the function f : u1 → (X → Y ) can be modeled as the object of the following C++class:

1“Closure” in this context refers to the standard computer science term that describes a function together withan environment (a set of variable bindings) that must be used for its evaluation [132]. It is in no way related to therecursion step closure, which is a mathematical set closure [133].


class func

int u1;

func( int u1)

this ->u1 = u1;

void compute(double *X, double *Y)

...u1 can be used...

;

Calling such function is a two step process, identical to the descriptor system of many transformlibraries:

func f(4);

f.compute(X, Y);

Summary. As we now explained how the higher-order functions naturally come into play, webriefly what this means for the compilation:

• Since∑

-SPL recursion steps are mutually recursive, we have a set of mutually recursivefunctions.

• Due to desired temporal discontinuity all of the functions are actually higher-order.

• If the target language does not directly support higher-order functions, they must be imple-mented with the help of lexical closures.

• The process is called closure conversion or λ-lifting [63, 71]. We have already used λ-liftingin Chapter 3 for a different purpose, namely to design index-free

∑-SPL.

• Our target is C++, it is known that closures and objects are equivalent [73], thus thus closureswill be represented as objects in C++.

Overall flow. We show overall flow of the Library Implementation block in Fig. 5.1, andexplain the individual steps next.

5.3 Library Plan

The first step in the library implementation process is the building of a so-called library plan.Library plan is a data structure that stores a more detailed representation of the recursion stepclosure, and provides full and clear description of the library structure.

Recall the recursion step closure (3.15) for complex DFT based on (3.1), which we repeat below(without using ∗’s):

DFTu1

Closure−−−−→

DFTu1

S(hu1→u2


u7, u8)

S(hu1→u2

u3, u4)DFTu1

diag(pre(u7

u1→C))G(hu1→u9

u10, u11)

S(hu1→u2

u3, 1 )DFTu1diag

(pre(u6

u1→C))G(hu1→u8

u9, u10)

(5.1)

After processing (5.1) we obtain a library plan shown in Table 5.1.

5.3. LIBRARY PLAN 101

RS 1

Formula: DFTu1

Parameters: u1

Children: RS 2, RS 3Breakdown 1: DFT-LibBase∗ Applicable: u1 = 2∗ Descend: DFT2

Breakdown 2: DFT-CTA∗ Applicable: isPrime(u1) = 0∗ Freedoms: f1 ∈ divisors(u1)∗ Descend:

u1/f1−1X

i=0

RS3“

(u2, u1, u3, u4, u7

Z→C, u9, u10, u11)← (u1, f1, i, u1/f1, Ω1

u1 dt hf1→u1

i, u1/f1, u1, i, u1/f1)

”

f1−1X

j=0

RS2 ((u2, u1, u3, u6, u7, u8)← (u1, u1/f1, u1j/f1, u1, j, f1))

RS 2

Formula: S(hu1→u2

u3, 1)DFTu1

G(hu1→u6u7, u8

)Parameters: u1, u2, u3, u6, u7, u8

Children: RS 2, RS 3Breakdown 1: DFT-LibBase∗ Applicable: u1 = 2∗ Descend: S(h2→u2

u3, 1)DFT2G(h2→u6

u7, u8)

Breakdown 2: DFT-CTA∗ Applicable: isPrime(u1) = 0∗ Freedoms: f1 ∈ divisors(u1)∗ Descend:

u1/f1−1X

i=0

RS3“

(u2, u1, u3, u4, u7

Z→C, u9, u10, u11)← (u2, f1, u3 + i, u1/f1, Ω1

u1 dt hf1→u1

i, u1/f1, u1, i, u1/f1)

”

f1−1X

j=0

RS2 ((u2, u1, u3, u6, u7, u8)← (u1, u1/f1, u1j/f1, u6, u7 + u8j, u8f1))

RS 3

Formula: S(hu1→u2u3, u4

)DFTu1diag

`

pre(u7u1→C)

´

G(hu1→u9u10, u11

)

Parameters: u1, u2, u3, u4, u9, u10, u11, u7Z→C


u3, u4)DFT2diag

`

pre(u72→C)

´

G(h2→u9u10, u11

)Breakdown 2: DFT-CTA∗ Applicable: isPrime(u1) = 0∗ Freedoms: f1 ∈ divisors(u1)∗ Descend: not shown

RS 4

Formula: S(hu1→u2

u3, 1)DFTu1

diag`

pre(u6u1→C)

´

G(hu1→u8u9, u10

)

Parameters: u1, u2, u3, u8, u9, u10, u6Z→C


u3, 1)DFT2diag

`

pre(u62→C)

´

G(h2→u8u9, u10

)Breakdown 2: DFT-CTA∗ Applicable: isPrime(u1) = 0∗ Freedoms: f1 ∈ divisors(u1)∗ Descend: not shown

Table 5.1: Library plan constructed from the recursion step closure in (5.1).


r.s. closure∑

-SPL implementations

Build library plan

Hot/cold partitioning

Code generation

Target language implementation

Figure 5.1: Library generation: “Library Implementation”. Input: recursion step closure andP

-SPL implementa-tions. Output: library implementation in target language.

Library plan contains an entry for each recursion step in the closure. Each recursion step isgiven a name, in our implementations the names are simply RS 1 – RS n. Each entry contains thefields below.

• Formula gives a parametrized∑

-SPL formula for the recursion step;

• Parameters lists all parameters of the recursion step formula;

• Children is a list of all other recursion steps that are invoked in the breakdown rule imple-mentations of the current recursion step;

• Parents is a list of all other recursion that invoke the current recursion step;

• Breakdown rules is a list of breakdown rules used for implementing (descending into) thecurrent recursion step.

For each breakdown rule, the library plan stores the applicability condition, as a function ofthe recursion step parameters, the degrees of freedom, and the

∑-SPL implementation, obtained

by applying the breakdown rule (descending). Breakdown rule implementation can invoke otherrecursion steps, in the library plan such calls are made explicit, by providing the name of therecursion step being called and the list of parameter substitutions, as explained further below.

The recursion step call RSi((u1, u2)← (a, b)) is semantically equivalent to the∑

-SPL formulasfor RSi with a substituted for u1, and b for u2.

A recursion step invocation from a∑

-SPL implementation is not a simple function call, becausethe cold parameters must be provided in advance as discussed earlier. Intuitively, invocation worksas follows. The cold parameters are provided for the root recursion step (RS 1). For each recursionstep call in RS 1, all cold parameters of the callees are initialized, and the callees initialize theirown callees. This results in a recursive initialization process.

5.4. HOT/COLD PARTITIONING 103

5.4 Hot/Cold Partitioning

At the first glance, it seems that choosing whether the parameters are cold or hot should be left tothe user. This is indeed possible for the simplest recursion steps which have very few parameters,for example DFTu1

with a single size parameter, where we clearly want it to be cold to enableprecomputation. However, libraries consist of many recursion steps, and even with a single libraryplan as in Table 5.1, the choice of hot/cold for other recursion steps (e.g. RS 2, RS 3 and RS 4) isnon-trivial.

In this section we explain how the parameters can be automatically partitioned into hot andcold using by taking into account the library requirements and a user-defined policy.

We define the following requirements:

• All precompute markers must be honored. This means that parameters that affect the pre-computed data are required to be cold. For example, u1 and u7 in RS 3, and u1 and u6 inRS 4 of Table 5.1.

• The choice of a breakdown rule and any degrees of freedom must be resolved at the initializationtime. This requires the parameters that affect breakdown rule applicability and the degreesof freedom are also required to be cold. For example, u1 in RS 1–4 of Table 5.1.

• Parameters that change in a loop must be hot. If a recursion step is called from within aloop, some of the parameters of the recursion step will change with the loop counter. Theseparameters of the callee must be hot, because the loop can not be executed at the initializationtime. For example, u3, u7, u10 in RS 3 in Table 5.1.

The parameters that were not affected by any of the requirements above can be either hot or cold,with the constraint that the cold parameters can’t depend on hot parameters. We propose twodifferent policies “as hot as possible” (AHAP) and “as cold as possible” (ACAP) for dealing withsuch parameters. Later, we will discuss the advantages and drawbacks of these policies.

The process of partitioning is equivalent to an iterative data flow analysis (IDA) [2] performedon a special form of a control flow graph, which we call a parameter flow graph. The nodes of theparameter flow graph are the individual parameters of each recursion steps, and the edges are thedependencies between the parameters. Parameter flow graph is obtained from the library plan,annotated with the data flow information. To start the IDA we need the initial state of each node,the data flow information, and few simple propagation rules. Now, we explain how this works indetail.

Parameter flow graph. To propagate the cold/hot information we need the equivalent of acontrol flow graph. Instead of using the recursion steps as the nodes of the control flow graph, wemake the individual parameters of recursion steps nodes. Hence, we call the graph the parameterflow graph.

If one recursion step (caller) calls another recursion step (callee), then the parameters of thecallee are computed from the parameters of the caller. For each such caller/callee parameter pairwe add a dependence edge to the graph. The dependence pairs can be easily read off recursionstep invocations in the “descend” fields of the library plan records. For example, inspecting theDFT-CTA breakdown rule descend in RS 1 in Table 5.1, shows the following variable bindings inthe RS 3 invocation:

(u2, u1, u3, u4, u7Z→C, u9, u10, u11)← (u1, f1, i, u1/f1,Ω

1u1 dt hf1→u1

i, u1/f1, u1, i, u1/f1).


Which leads to the following dependence edges:

u1 → u2,

u1 → u3,

u1 → u4,

u1 → u7,

u1 → u9,

u1 → u10,

u1 → u11.

Above, the left-hand side refers to the parameter u1 of the caller (RS 1) and the right-hand to theparameters of the callee (RS 3). Combining all of the dependence edges leads to the parameterflow graph shown in Fig. 5.2.

The edges u1 → u3 and u1 → u10 are not obvious. From the variable bindings we see that bothu3 and u10 are computed from the loop index variable i , and since i’s range as the loop variable isbounded by u1/f1 we consider it dependent on u1.

Algorithm. The automatic hot/cold partitioning is a two phase IDA algorithm on the param-eter flow graph. The input is the parameter flow graph, and the output is the assignment of a stateto each parameter (node). The nodes of the graph can be in four states: “none”, “cold”, “hot” and“reinit”. The algorithm initializes each node to the “none” state, which means that no decisionhas been made. Then we apply two IDA phases. The first phase is a backward IDA that marksall mandatory cold parameters. The second phase is a forward IDA that marks all mandatory hotparameters, and in case the case that a parameter must be cold and hot at the same time, puts it inspecial “reinit” state, explained below. Finally, when the second IDA phase is complete, we markall unmarked parameters (i.e., in “none” state) as either hot or cold depending correspondingly onwhether AHAP or ACAP policy is used.

It can happen that a parameter depends on the loop index, and thus must change as the loopindex runs, and at the same time the parameter is inside the precompute marker. An example ofsuch parameter is u7 in RS 3. It is the generator function for the twiddle diagonal. In this case weassign this parameter “reinit” status, which means that the caller of the environment RS 3, willcreate several copies of RS 3 descriptor with different values of the u7. The number of copies willbe equal to the number of iterations of the loop, from which RS 3 is invoked.

Initial state assignment. Initial state assignment is done according to the rules below:

1. Mark all freedom parameters fi as “cold”.

2. Mark all parameters that appear in the “applicable” and/or “freedoms” fields of the break-down rules in the library plan as “cold”. Examples of such parameters are u1 in all recursionsteps of the library plan in Table 5.1.

3. Mark all parameters that appear inside the pre(·) marker as “cold”. Examples of such pa-rameters are u7 and u1 in RS 3, and u6 and u1 in RS 4 in Table 5.1.

4. Put all other parameter nodes into “none” state.

The initial assignment makes sure that the first two library requirements explained earlier aresatisfied. Next, the phase 1 IDA will propagate the “coldness” backwards.

5.4. HOT/COLD PARTITIONING 105

RS 1

RS 2 RS 3 RS 4

u1

u1

u2u3

u6 u2

u3

u4 u9u10 u11u7u7

u8 u1 u1

u2 u3 u6

u8

u9

u10

(a) Original unpartitioned graph

RS 1

RS 2 RS 3 RS 4

u1

u1

u2u3

u6 u2

u3

u4 u9u10 u11u7u7

u8 u1 u1

u2 u3 u6

u8

u9

u10

(b) After initial state assignment and phase 1 IDA (phase 1 does not change any states)

RS 1

RS 2 RS 3 RS 4

u1

u1

u2u3

u6 u2

u3

u4 u9u10 u11u7u7

u8 u1 u1

u2 u3 u6

u8

u9

u10

(c) After phase 2 IDA

Figure 5.2: Parameter flow graph obtained from the library plan from Table 5.1 and various stages of the automatichot/cold partitioning algorithm. Node color and shape denotes state: white square = “none”, black square =“cold”,gray square = “hot”, gray octagon = “reinit”.


Phase 1: Backward IDA. The goal of this phase is to find all parameters that must be cold.A cold parameter is computed at initialization time. If its value depends on other parameters, thenthe other parameters must also be known at initialization time, i.e., they must be cold.

This is done by performing a backward IDA with one simple rule. For each dependence edgep → q, if q is marked “cold”, and p is marked “none”, then mark p as “cold”. The rule is applieduntil the fixed point is reached.

After the initial assignment in the parameter flow graph in Fig. 5.2 there are no such dependenceedges. There are only edges with both nodes being cold already, for example u1 of RS 1 to u1 ofRS 2.

Phase 2: Forward IDA. The goal of this phase is to fulfill the third library requirement, i.e.,make the parameters that depend on loop indices hot. Similarly to phase 1, we need to propagatethe “hotness” along the dependence edges, but now forward instead of backward. Clearly, a hotparameter can not be used to initialize the cold parameter, thus the “hotness” is propagatedforward. Another caveat is that a parameter that depends on a loop index can be already marked“cold”. In this case, we change the state to “reinit”, which means that the enclosing environmentis replicated to cover all possible values of such parameter. The set of possible values is boundedby the number of iterations of the loop. For all other purposes “reinit” parameters are treated as“cold”, and thus no further propagation is necessary.

The phase 2 is defined by the following three IDA rules:

1. If parameter p is marked “none”, and depends on a loop index, mark p “hot”.

2. If parameter p is marked “cold”, and depends on a loop index, mark p “reinit”.

3. For each dependence edge p→ q, if p is marked “hot”, and q is marked “none”, then mark qas “hot”.

Note that dependence edges with p “hot” and q marked “cold” are not possible, because phase1 propagates coldness backwards, and any edge p → q with q cold will result in p being marked“cold” as well. Also, if p is marked “cold” and depends on the loop index, it is marked “reinit”,which is not propagated further.

ACAP vs AHAP policy. After the phase 2 IDA is complete some nodes may still be marked“none”. Under the ACAP policy these nodes will be marked “cold”, and under AHAP they willbe marked “hot”.

The ACAP policy exposes most parameters at the initialization time. The values of theseparameters form important context information, which enables accurate search over the degrees offreedom. This search happens at the initialization time. With AHAP policy some parameters arenot yet known at the initialization time, and thus the search must assume some default values forthem. An example of parameters which can drastically change performance, are input and outputstrides. With the AHAP policy the search becomes context insensitive and thus less accurate.

The AHAP policy on the other hand, is useful when it is desired to change some parameters (forexample, input and output data strides) without reinitializing. In addition, the AHAP policy alsominimizes the size of the descriptor, since less parameters need to be stored after the initialization.

For libraries with small number of parameters, there may be no difference between ACAP andAHAP libraries’ interfaces, but the choice of policy is still important. For example, in the libraryplan in Table 5.1 the user requested recursion step is RS 1, which has only one parameter (DFTsize, u1). Under both policies u1 is necessarily cold, and thus the interface of RS 1 does not change

5.5. CODE GENERATION 107

depending on the hot/cold policy chosen. However, the two libraries are still different, becausethe recursion steps used for implementing RS 1, will have different hot/cold partitions, based onthe policy. In this case, ACAP policy enables more accurate runtime search over the degrees offreedom, and AHAP policy minimizes the descriptor size.

Other policies. It is possible to implement other policies for marking the remaining nodes.However, any node partition must satisfy the timimg requirement. Namely, a hot parameter cannot be used to initialize a cold parameter. Formally, for each dependence each p → q, if p is hot,than q has to be hot.

Suppose, that ACAP policy is desired, but one specific parameter p is desired to be hot. Oneway yo implement this is to perform the normal partitioning. If p is marked “cold” or “reinit”,then signal an error. If p is marked “hot” then mark all other nodes cold (according to ACAPpolicy). Finally, if p is marked “none”, mark it “hot”, repeat phase 2 IDA, and mark remaining“none” nodes as “cold”.

5.5 Code Generation

The final stage is to generate C++ classes for each recursion step in the library plan. Each C++class will have a constructor and the compute method. The constructor performs the initializationtasks, and saves the values of the cold parameters. The compute method takes hot parameters asarguments, and uses the saved values of the cold parameters.

As we mentioned earlier, these classes are the equivalents of the “descriptors” in the regulartransform libraries. Semantically, they serve the purpose of lexical closures, which in turn implementhigher-order functions.

Non-adaptive library example. In Table 5.2 we show a code fragment of a generated non-adaptive DFT library based on the recursion step closure in (5.1) and library plan in Table 5.1.We show the C++ class generated for RS 1 (called Env 1, where “Env” stands for environment).RS 1 corresponds to the

∑-SPL formula DFTu1

, with only one parameter u1, which is the size ofthe DFT. (We omit the Env 1 destructor and auxiliary classes in the listing.)

In Table 5.2 the constructor Env 1::Env 1 (line 9) and Env 1::compute (line 36) are formed bycombining multiple alternative

∑-SPL implementations. The applicability conditions of alternative

implementations are checked in the constructor and the first implementation with applicable condi-tion being true is selected, saving the result in the class attribute rule. In our case the constructorchooses out of three alternative implementations: the base case for DFT of size 2 (with conditionu1 = 2, line 11), base case for DFT of size 4 (u1 = 4, line 15), and the recursive implementationfor other sizes (line 19), which must be non-prime.

If the recursive implementation is selected, the child recursion steps must be created and ini-tialized. As we can see from the library plan in Table 5.1 (and the call graph in Fig. 3.4(b)), RS 1has two children – RS 2 and RS 3, which correspond to classes Env 2 and Env 3. The initializationof RS 3 in lines 24–28 creates multiple copies due to the “reinit” parameter u7 (diagonal generatingfunction, which changes with the loop index). This generating function is also created at runtime,and thus is a higher-order function (just like DFT), and thus also requires a lexical closure, i.e., aclass (here it is Func 1).

In this case the recursive implementation has a degree of freedom f 1 and its value is determinedin line 20, using a heuristic chooseDivisor, and without using any platform adaptation.

Env 1::compute dispatches to the right implementation based on the value of rule which was


set during the initialization in the constructor. Lines 51–58 implement the recursion, by recursivecalling the compute method of the child recursion steps.

This library is missing two important optimizations. First, the allocated buffer in the compu-tation (in Env 1::compute) can be eliminated. Second, the recursion steps do not contain loops,and as a consequence there is no loop in the base case. Both of these optimization are implementedin our generator, as we explained in Chapter 3.

Using the generated library. The usage of the generated library is very simple, we show theusage example below:

Env_1 dft(1024);

dft.compute(Y, X);

The above fragment computes the DFT of X and stores the result in Y .Adaptive libraries. In order for the library to be adaptive, it must automatically choose

the value of f 1. It turns out that it is rather easy to incorporate into the generated libraries, byadding several extra methods to the classes, but preserving the overall code structure of Table 5.2.We will not show an example here.

5.5. CODE GENERATION 109

class Env_1 : public Env

2 int u1, _rule , f1;

char *_dat; // for precomputed data

4 Env *child1 , *child2 ;

Env_1( int u1);

6 void compute (double *Y, double *X);

;

8

Env_1 :: Env_1 ( int u1)

10 this ->u1 = u1;

i f ((( u1) == (2)))

12 _dat = (char *) NULL;

_rule = 1;

14

else i f ((( u1) == (4)))


_rule = 2;

18

else i f ((( isPrime (u1)) == (0)))

20 f1 = chooseDivisor(u1);

22 // need multiple versions with different constants ,

// generated using different instances of Func_1

24 child1 = new Env_3 [(u1/f1)];

for( int i40 = 0; i40 < (u1/f1); i40 ++)

26 *(cast <Env_3 *>( child1 ) + i40 ) = Env_3 (f1 ,

new Func_1 (i40 , f1 , (u1/f1), u1, f1 , (u1/f1)));

28

child2 = new Env_2 ((u1/f1));


_rule = 3;

32

else error ("no applicable rules ");

34

36 void Env_1 :: compute (double *Y, double *X)

i f ((( _rule) == (1))) // Base case , DFT_2

38 double a902 , a903 , a904 , a905;

a902 = *(X);

40 a903 = *((2 + X));

a904 = *((1 + X));

42 a905 = *((3 + X));

*(Y) = (a902 + a903 );

44 *((1 + Y)) = (a904 + a905 );

*((2 + Y)) = (a902 - a903 );

46 *((3 + Y)) = (a904 - a905 );

48 else i f ((( _rule ) == (2))) // Base case , DFT_4

.. skipped ..

50

else i f ((( _rule ) == (3))) // Recursion

52 double * T35; // unnecessary buffer

T35 = LIB_MALLOC ( s izeof(double) * (2 * u1 ));

54 for( int i41 = 0; i41 < f1; i41 ++)

cast <Env_2 *>( child2 )-> compute (T35 , X, i41 , u1, f1 , ((u1*i41 )/f1), u1);

56 for( int i40 = 0; i40 < (u1/f1); i40 ++)

(cast <Env_3 *>( child1 ) + i40)-> compute (Y, T35 , i40 , u1 , (u1/f1), i40 , u1, (u1/f1 ));

58 LIB_FREE (T35 , s izeof(double) * (2 * u1 ));

60 else error ("no applicable rules ");

Table 5.2: Fragment of the automatically generated code for DFTu1(with our comments added), based on the

recursion step closure in Fig. 3.4 and library plan in Table 5.1. Func 1(...) creates a generating function for thediagonal elements, and is passed down to the child recursion step (RS 3), which will use it to generate the constants.Since each iteration of the loop uses a different portion of the constants, several copies of Env 3 are created. Suchcopying is always required if a child recursion step has “reinit” parameters (u7 in RS 3 in this case).

Chapter 6

Experimental Results

In this chapter we perform thorough performance evaluation of the library generator. We comparethe generated libraries to the best hand-written implementations, including the appropriate vendorlibraries. Next, we also discuss additional issues such as handling the higher-dimensional transforms,the efficiency of the vectorization and parallelization, and the flexibility of the generator.

6.1 Overview and Setup

Platforms. We ran our experiments on three different platforms, described in detail below:

• 3 GHz Intel Xeon 5160 (server version of Core 2 Duo) processor based computer. Thismachine had 2 dual-core processors (total of 4 processor cores) with 64 KB of private L1cache per code and 4 MB of shared L2 cache per socket.

• 2.8 GHz AMD Opteron 2220 processor based computer. This machine also has 2 dual-coreprocessors (total of 4 processor cores) with 256 KB of private L1 cache per core, 1 MB ofprivate L2 cache per core (2 MB per socket), and a dedicated communication channel betweencores within a pair.

• 3 GHz Intel Core 2 Extreme QX9650 (code name “Penryn”) processor based computer. Thisis the latest Intel offering. The machine has 2 dual-core processors (total of 4 processor cores)with 64 KB of private L1 cache per core and 6 MB of shared L2 cache per socket. We havenot exhaustively benchmarked on this machine, since it was released close to the completionof this thesis, and provide only a small number of benchmarks.

All three computers were running Linux in 64-bit mode. All of the generated libraries were in C++compiled using the Intel C/C++ Compiler 10.1. Vectorized code was emitted using SSE and SSE2intrinsics, and threading was implemented with OpenMP pragmas with explicit barriers.

Transforms. We generated general size libraries for the following transforms:

• Discrete Fourier transform (DFT);

• Walsh Hadamard transform (WHT) [17,18];

• Real discrete Fourier transform (RDFT, also known as DFT of real data or real-valued DFT)[20,98,112,125];

111

112 CHAPTER 6. EXPERIMENTAL RESULTS

• Discrete Hartley transform (DHT) [31,32,98,125,126];

• DCT-2, DCT-3 and DCT-4 [33,96,97,103];

• FIR filter [62,93];

• Downsampled FIR filter (building block of the wavelet transform) [62,113,124];

• 2-dimensional DFT and DCT-2.

Hand-written libraries. We compared against FFTW 3.2 alpha 1, Intel IPP 5.2, and alsoAMD APL 1.1 on the AMD platform. We chose FFTW 3.2 alpha since it provided improved thread-ing support and additional performance improvements. FFTW was compiled with pthreads, andnot with OpenMP (as our generated libraries), since this resulted in some performance degradation.

We did not benchmark AMD APL on the Intel platforms, however, because AMD APL onlyimplements the DFT and RDFT we did benchmark Intel IPP on the AMD platform.

Intel IPP 5.2 does not provide threading for DFT, RDFT, and DCTs. After we performedmost of the benchmarks, we have learned that version 5.3 of IPP became available, which doesinclude threading for these transforms, and in addition is slightly faster for single-threaded code.Unfortunately, we did not have enough time to benchmark the new IPP version in this thesis.

Generated libraries. We generated single and multi-threaded libraries with scalar, 2-wayvectorized, and 4-way vectorized code. Table 6.1 shows a subset of the breakdown rules from whichthe trigonometric transform libraries were generated.

For the FIR filters we used only the time domain breakdown rules (not shown), that computethe filter by definition (i.e., by performing the direct matrix-vector product), and block the filtermatrix in various ways. Spiral also contains frequency domain FIR filter rules, which use the DFT,but we did not generate the libraries from these rules due to the reason explained next.

One current limitation of the library generator is that it cannot generate code for the breakdownrules that must compute the DFT or some other transform as part of the initialization. Examplesof such rules include frequency domain filtering rules, and Rader [102] and Bluestein [27] rules forthe DFT. This limitation is only due to lack of time and can be easily fixed in the future. Thislimitation only affects the recursion, the rules can still be used inside the small fixed-size base cases,which are generated using standard fixed-size code Spiral capabilities.

To constrain the number of benchmarks, for trigonometric transforms we generated librarieswhich support 2-power sizes only. Libraries for larger classes of sizes can also be generated, asdemonstrated in Fig. 6.5. Since Rader and Bluestein FFT algorithms currently can not be compiledas recursions, libraries which supported sizes with large prime factors cannot be yet generated.

Table 6.2 shows the number of recursion steps in our generated libraries. All libraries userecursion steps that include loops as explained in Section 3.6. This ensures better performance andenables parallelism optimizations. However, it increases the number of recursion steps, since foreach recursion step with loop there is a corresponding unlooped recursion step (the loop body). Thebase case can be generated for either variant, but generating a base case for the looped recursionstep leads to better performance.

In Table 6.2, the scalar DFT library uses the Cooley-Tukey FFT breakdown rule (2.1), to matchour running example. Vectorized DFT libraries use the breakdown rule (6.1) from Table 6.1. (6.1)is different form (2.1), and requires less vector shuffles. It is based on the algorithm from [125].

1FFTW is partially generated. The library has hand-written recursion, and uses fully unrolled automaticallygenerated code (codelets) for the base cases, i.e., small size transforms.

6.1. OVERVIEW AND SETUP 113

DFTn = P⊤

k/2,2m

`

DFT2m ⊕`

Ik/2−1 ⊗i C2m rDFT2m(i/k)´´ `

RDFT′

k ⊗Im´

, k even, (6.1)˛

˛

˛

˛

˛

˛

˛

˛

RDFTn

RDFT′n

DHTn

DHT′n

˛

˛

˛

˛

˛

˛

˛

˛

= (P⊤

k/2,m ⊗ I2)

0

B

B

@

˛

˛

˛

˛

˛

˛

˛

˛

RDFT2m

RDFT′2m

DHT2m

DHT′2m

˛

˛

˛

˛

˛

˛

˛

˛

⊕

0

B

B

@

Ik/2−1 ⊗i D2m

˛

˛

˛

˛

˛

˛

˛

˛

rDFT2m(i/k)rDFT2m(i/k)rDHT2m(i/k)rDHT2m(i/k)

˛

˛

˛

˛

˛

˛

˛

˛

1

C

C

A

1

C

C

A

0

B

B

@

˛

˛

˛

˛

˛

˛

˛

˛

RDFT′

kRDFT

′

kDHT

′

kDHT

′

k

˛

˛

˛

˛

˛

˛

˛

˛

⊗ Im

1

C

C

A

, k even, (6.2)

˛

˛

˛

˛

rDFT2n(u)rDHT2n(u)

˛

˛

˛

˛

= L2nm

„

Ik ⊗i

˛

˛

˛

˛

rDFT2m((i + u)/k)rDHT2m((i + u)/k)

˛

˛

˛

˛

« „˛

˛

˛

˛

rDFT2k(u)rDHT2k(u)

˛

˛

˛

˛

⊗ Im

«

,

RDFT-3n = (Q⊤

k/2,m ⊗ I2) (Ik ⊗i rDFT2m)(i + 1/2)/k)) (RDFT-3k ⊗Im) , k even,

DCT-2n = P⊤

k/2,2m

“

DCT-22m K2m2 ⊕

“

Ik/2−1 ⊗ N2m RDFT-3⊤2m

””

Bn(Ln/2

k/2⊗ I2)(Im ⊗ RDFT

′

k)Qm/2,k ,

DCT-3n = DCT-2⊤n ,

DCT-4n = Q⊤

k/2,2m

“

Ik/2 ⊗ N2m RDFT-3⊤2m

”

B′n(L

n/2

k/2⊗ I2)(Im ⊗ RDFT-3k)Qm/2,k .

Table 6.1: Breakdown rules for a variety of transforms: DFT for real input (RDFT), discrete Hartley transform(DHT), discrete cosine transforms (DCTs) of types 2–4. The other are auxiliary transforms needed to compute them.P, Q are permutation matrices, T are diagonal matrices, B, C, D, N are other sparse matrices. The first and secondrule are respectively for four and two transforms simultaneously.

Number of recursion steps

Transform scalar vectorized vectorized + parallelized

DFT 4 / 3 4 / 7 8 / 8RDFT 4 / 6 10 / 10 12 / 10DHT 4 / 6 10 / 10 12 / 10DCT-2 5 / 9 11 / 13 13 / 13DCT-3 5 / 9 12 / 16 14 / 16DCT-4 4 / 4 6 / 4 8 / 4WHT 4 / 3 6 / 4 7 / 42D DCT 10 / 14 12 / 13 14 / 132D DFT 7 / 9 11 / 13 13 / 13FIR Filter 4 / 4 4 / 5 4 / 4Downsampled FIR Filter 4 / 4 4 / 5 4 / 4

Table 6.2: Number of recursion steps m/n in our generated libraries. m is the number of steps with loops; n is thenumber without loops and a close approximation of the number of base cases (codelets) needed for each small inputsize.


Code size

Transform non-parallelized parallelized

no vectorizationDFT 13.1 KLOC / 0.59 MB 10.3 KLOC / 0.45 MBRDFT 8.5 KLOC / 0.36 MB 8.8 KLOC / 0.39 MBDHT 9.1 KLOC / 0.40 MB 9.4 KLOC / 0.39 MBDCT-2 12.0 KLOC / 0.55 MB 12.4 KLOC / 0.57 MBDCT-3 12.0 KLOC / 0.56 MB 12.3 KLOC / 0.59 MBDCT-4 6.8 KLOC / 0.33 MB 7.1 KLOC / 0.35 MBWHT 5.6 KLOC / 0.21 MB —

2-way vectorizationDFT 14.8 KLOC / 0.73 MB 15.0 KLOC / 0.74 MBRDFT 15.6 KLOC / 0.76 MB 16.0 KLOC / 0.81 MBscaled RDFT 16.0 KLOC / 0.78 MB —DHT 16.9 KLOC / 0.83 MB 17.2 KLOC / 0.87 MBDCT-2 20.7 KLOC / 1.10 MB 21.0 KLOC / 1.09 MBDCT-3 27.9 KLOC / 1.56 MB 28.2 KLOC / 1.59 MBDCT-4 7.8 KLOC / 0.47 MB 8.1 KLOC / 0.50 MBWHT 6.9 KLOC / 0.32 MB 5.8 KLOC / 0.26 MBFIR Filter 167 KLOC / 7.75 MB 120 KLOC / 5.12 MBDownsampled FIR Filter 100 KLOC / 4.2 MB 68 KLOC / 2.76 MB

4-way vectorizationDFT 17.9 KLOC / 1.09 MB 18.2 KLOC / 1.11 MBRDFT 16.2 KLOC / 0.86 MB 16.5 KLOC / 0.91 MBscaled RDFT 16.5 KLOC / 0.88 MB —DHT 17.9 KLOC / 1.02 MB 18.3 KLOC / 1.04 MBDCT-2 23.3 KLOC / 1.50 MB 23.6 KLOC / 1.53 MBDCT-3 32.0 KLOC / 2.17 MB 32.3 KLOC / 2.20 MBDCT-4 8.3 KLOC / 0.63 MB 8.6 KLOC / 0.66 MBWHT 8.5 KLOC / 0.53 MB 6.9 KLOC / 0.4 MB2D DFT 20.6 KLOC / 1.32 MB 20.8 KLOC / 1.33 MB2D DCT-2 27.0 KLOC / 2.1 MB 27.2 KLOC / 2.11 MBFIR Filter 109 KLOC / 5.69 MB 74 KLOC / 3.44 MBDownsampled FIR Filter 151 KLOC / 7.7 MB 92 KLOC / 4.61 MB

Table 6.3: Code size of our generated libraries. KLOC = kilo (thousands) lines of code. WHT, scaled and 2Dtransforms are not shown in all possible variants.

6.1. OVERVIEW AND SETUP 115

Table 6.3 shows the code size of different generated libraries.

Search. The generated libraries are partially adaptive, and provide a mechanism to search overdegrees of freedom within each breakdown rule at library runtime (during the initialization). Forexample, the library can search for the best radix size in the Cooley-Tukey breakdown (2.1), andthe best block size in the filter blocking rules. Currently, the library cannot automatically choosethe best alternative out of several applicable breakdown rules.

The search mechanism keeps the value of the degree of freedom that yields the best runtime.The best choice is hashed against the recursion step number and its cold parameters. The searchinfrastructure was implemented by Frederic de Mesmay, and thanks to him the effort to producegood runtime results was greatly reduced.

FFTW also uses runtime performance adaptation to search for the best recursion strategy. TheFFTW search mechanism can find the best “solver” (equivalent of breakdown rule here) and searchover the degrees of freedom within the solver (e.g. radix in Cooley-Tukey FFT). As far as we know,neither Intel IPP nor AMD APL use search or any other form of adaptation.

Sizes. For all transforms except FIR filters, we benchmarked only 2-power sizes (except inFig. 6.5) smaller than 216. We benchmarked the 2-power sizes, as these are the most commonlyused sizes, and provide a reasonable way to constrain the space of benchmarks.

We also limit the transform sizes so that the working set fits into L2 cache of the computer.On the Intel platforms the largest size which fits is about 216, and on the AMD platform it issmaller. Larger sizes, due to cache thrashing caused by large strides, require special optimizations,for example, copying the data into a buffer. These large optimizations are implemented in standardSpiral as breakdown rules, however, we did not yet port them to be compatible with the librarygenerator.

Due to the lack of the proper algorithm, the performance of generated libraries for these largersize is very poor, and we don’t think it makes sense to compare against properly optimized libraries.As soon as the large size breakdown rules are ported, we can generate the fast out-of-L2 cachelibrary.

Performance metrics. To show performance of generated code, instead of runtime, we usepseudo Gflop/s (giga floating point operations per second), which allows us to use the same linearscale for different transform sizes. This is the standard practice in the field. “Pseudo” meansthat instead of the actual operations count we use a normalized value, so that pseudo Gflop/s areproportional to inverse runtime. Pseudo Gflop/s are computed as follows:

pseudo Gflop/s =normalized arithmetic cost

runtime [sec]· 109.

The normalized arithmetic cost is computed as

• 5n log2 n for the 1-dimensional length n DFT;

• 5mn log2 mn for the 2-dimensional m× n DFT;

• 2.5n log2 n for the 1-dimensional length n RDFT, DHT and DCTs of all types;

• 2.5mn log2 mn for the 2-dimensional m× n RDFT, DHT and DCTs of all types;

• n log2 n for the WHT;


• 2nk for the regular and downsampled FIR filter with n output samples and k taps. Thenumber of input samples is hence n + d(k − 1), where d is the downsampling factor. Forregular (not downsampled) filter d = 1.

6.2 Performance of Generated Libraries, Overview

In this section we provide a brief glimpse of the performance of generated libraries. We demonstratea wide variety of transforms for which libraries can be generated, show generated libraries in differentprecisions, and compare scalar (non-vectorized) code performance to FFTW.

The complete set of benchmarks can be found in Appendix A.

6.2.1 Transform Variety and the Common Case Usage Scenario

In the first series of benchmarks we show a variety of useful functionality for which libraries canbe generated, and focus on what we believe to be the most common case usage scenario. Namely,we assume that the typical user wants to compute a transform in double precision and as fast aspossible.

For most users this means that the maximal level of optimizations should be used together withthe most common platform configuration. This means 2-way vectorization (maximum for doubleprecision), and up to 2 concurrent threads, since a single dual-core processor platform is the mostcommon today.

In the experiments that follow, we generated 2-way vectorized and threaded libraries whichsupport only 2-power size transforms. We always show one line per library, which shows the bestperformance between 1 and 2 threads. For all transforms the libraries switch to 2 threads at sizes1024 or 2048. Intel IPP 5.2 does not support threading for the transforms we consider, and thusits performance is significantly worse for larger (>2048) sizes.

DFT and RDFT. The discrete Fourier transform is probably the most important linear trans-form used. Many hand-written optimized libraries exist for the complex DFT and the real-valuedDFT (RDFT) and a lot of optimization effort went into maximizing their performance.

Today, fast DFT implementations are available from the hardware vendors, including Intel andAMD. These highly optimized implementations become the “baseline” of “good” performance.

We show the breakdown rules used for generation of libraries in Table 6.1. For the complexDFT, instead of the Cooley-Tukey FFT (2.1) we use the non-standard recursion based on theRDFT as shown in Table 6.1. It provides better performance for vector code, and provides a goodmotivation for the use of a library generator. For the real DFT, we use a similar breakdown rule,which is a general-radix RDFT algorithm from [125].

The respective performance plots are shown in Fig. 6.1. The generated DFT and RDFT librariesare slightly faster than the other libraries on the Intel Xeon and to a lesser extent on the AMDOpteron. The speedup comes from the more efficient DFT and RDFT algorithms (6.1) and (6.2),which minimize the amount of vector shuffles in the vectorized implementations. Both of thesealgorithms suffer from more expensive index computations at sizes beyond 4096 (for example, seeFigure A.2 in Appendix A), however, in this case the index computation overhead is compensatedby threading.

On the AMD platform APL has generally the best performance single thread DFT performance,except for smaller sizes. However, the generated library starts using 2 threads already at size 1024wiping out the APL advantage completely.

6.2. PERFORMANCE OF GENERATED LIBRARIES, OVERVIEW 117

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz

IPP 5.2FFTW 3.2alpha2Generated library

0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz

IPP 5.2APL 1.1FFTW 3.2alpha2Generated library

0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


Figure 6.1: Generated libraries for complex and real Fourier transforms (DFT and RDFT). These are the mostwidely used transforms, and both vendor libraries and FFTW are highly optimized.

On both platforms, FFTW has suboptimal threading performance, perhaps due to using theirown implementation of thread pooling using pthreads. The generated library benefits from thefaster implementation in OpenMP. We did not enable OpenMP in FFTW, since it was slower thanpthreads for some unknown reason, as we mentioned in Section 6.1.

DCT-2 and DCT-4. There is a total of 16 types of discrete cosine and sine transforms. Herewe benchmark the two most commonly used, the type-2 and type-4 transforms. The type-2 DCTis sometimes called simply the forward DCT. We compare the performance in Fig. 6.2.

The results need a more detailed explanation. On the Intel Xeon, the generated library is onaverage about 2 times faster than FFTW, and up to 5 times faster than IPP. On the AMD platformthe speedup is smaller but also sustained. The divergence between FFTW and IPP at larger sizesis due to the threaded DFT in FFTW. The reason for the big performance gap is twofold.

First, the DCTs are less commonly used than the DFT, and thus less resources are spent ontheir implementation and optimization. For example, not all DCT types are supported by IPP,and APL does not implement DCTs at all.

Second, our generated library uses a different algorithm, which is hard to implement withoutan full-program generator, as we explain next.

The common methods for computing the DCTs focus on reusing optimized DFT implementa-tions. They involve either a conversion to real DFT with pre- and post-processing passes [81], ora “radix-2” split into half-size DCTs, for example [37, 127]. The use of these methods is due to


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure 6.2: Generated libraries for discrete cosine and discrete Hartley transforms (DCT-2 and DHT). Less effort isspent on hand-optimizing these transforms, and thus hand-written libraries are either slower or lack the functionality.Neither IPP nor APL provide the DCT-4, and in addition APL does not implement DCT-2.

the common belief that the DFT (or RDFT) is the fundamental building block, on top of whichthe DCTs must be implemented. However, the explicit unmerged pre- and post-processing stageslead to suboptimal performance due to a number of factors. First, explicit passes over the arraydecrease memory reuse and result in memory hierarchy overhead. Second these extra passes need tobe hand-written using SIMD vector instructions for best performance (which means that a differentimplementation for each vector length is necessary), and as far as we could tell, this was not donein FFTW. Finally, the extra passes limit the possible parallelization speedup since they are eitherimplemented in serial code or require additional barriers.

Our generated library, on the other hand, employs native general-radix recursive Cooley-Tukeytype algorithms for the DCTs (shown in Table 6.1), which are closely related to the algorithmsin [97]. All of the permutations and other sparse matrices are appropriately merged with theadjacent loops resulting in the other recursion steps and thus other types of base cases (or “codelets”in FFTW terminology) than those occurring in the RDFT algorithm.

Manually implementing such DCT library based on this algorithm would have to be donefrom scratch, because very little code can be reused from any existing optimized DFT/RDFTimplementation.

Discrete Hartley and Walsh Hadamard transforms. We show the results for the DHT andthe WHT in Fig. 6.3. Neither of this transforms is provided by IPP and APL, so we only compare


0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

WHT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

1

2

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

WHT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


Figure 6.3: Generated libraries for discrete Hartley transform and Walsh Hadamard transform (DHT and WHT).Less effort is spent on optimizing these transforms, and thus hand-written libraries are much slower. IPP and APLdo not provide DHT and WHT at all.

against FFTW. FFTW directly supports only DHT. However, since the 2n WHT is equivalent toan n-dimensional 2 × · · · × 2 real DFT, we can still compute it using FFTW, because it supportsDFTs of arbitrary dimensions.

Again, the generated library outperforms FFTW by a big margin. For example, the WHT onan Intel Xeon is up to 4 times faster in the generated library, and the DHT is up to 1.75 timesfaster. The reasons for the big difference are similar to those for the DCTs.

For the DHT, we employ the native general-radix algorithm from [125]. The generator performsloop merging, determines the right set of base cases (codelets) and generates the library. FFTW usesa conversion to the real DFT with a post-processing step, which results in performance degradation,and also limits the speedup achievable through parallelization.

For the WHT, we use a general-radix splitting algorithm which is equivalent to the FFTW’s splitalong the dimensions of the 2× · · · × 2 real DFT. However, FFTW performance is far suboptimal,because it does not provide codelets for the WHT (the only available codelet is RDFT2 = WHT2),and does not vectorize it.

FIR filters and wavelets. We compare the performance of FIR filters and downsampled filtersin Fig. 6.4. A pair of downsampled filters constitutes a wavelet transform. On both platforms onlyIPP implements FIR filters. It supports both regular and downsampled (called “multirate”) FIRfilters. Our generated library computes all filters by definition, i.e., by directly evaluating the


0

2

4

6

8

10

12

14

16

18

10 20 30 40 50 60

Taps

Length-5000 FIR Filter, downsampling=1, 2 thread(s)Performance [Gflop/s] Intel Xeon 5160, 3000 MHz

IPP 5.2, singleIPP 5.2, doubleGenerated, singleGenerated, double

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps

Length-5000 FIR Filter, downsampling=1, 2 thread(s)Performance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

2

4

6

8

10

12

14

16

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

10 20 30 40 50 60

Taps



Figure 6.4: Generated libraries for FIR filters, first row is a plain filter, second row is a filter downsampled by 2,which is the building block for a 2-channel wavelet transform.

matrix-vector product and without using the DFT. The corresponding breakdown rules are called“time-domain” filtering rules, and they perform various forms of blocking of the FIR filter matrix.We also use some FIR filter specific vectorization breakdown rules.

Our generated libraries are only efficient for small number of filter taps (< 60), because theydon’t incorporate the “frequency-domain” breakdown rules due to a current limitation of the gen-erator explained in Section 6.1. Frequency domain methods reduce the cost of the FIR filter byusing a pair of DFTs, and are the best way to compute FIR filters with a larger number of taps.

Overall, the generated libraries are competitive with IPP on both platforms. On the Intelplatform IPP is generally faster, and on the AMD platform the generated libraries are usuallyfaster.

6.2.2 Non 2-power sizes

We made a decision to constrain the benchmarks for transforms that use algorithms relying onfactoring the size to 2-power sizes only. This includes all transforms, except FIR filter. Thelimitation allows us to generate libraries with base cases for 2-power sizes only, and thus speeds upthe generation.

The library generator, however, can produce libraries for any class of sizes for these transforms.To control the class of sizes supported by a library that uses a size factoring based algorithm, like


0

2

4

6

8

10

10 100 1000 10000

Length

Complex DFT, double precision (2-way vectorized), 1 threadPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


Figure 6.5: Generate DFT library for sizes N = 2i3j5k. No threading is used.

Cooley-Tukey FFT, one merely needs to include additional base cases.

As an example, we generated a library for double precision DFT with base cases for n ∈2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 32, 64 using the algorithm (6.1) from Table 6.1. This library supportsDFT sizes of the form N = 2i3j5k.

Fig. 6.5 shows the performance of this generated library. The library is faster due to theminimized shuffle overhead of (6.1), as is the case with the 2-power sizes in Fig. 6.1 in Section 6.2.1.

6.2.3 Different Precisions

Library generator can be used to produce the libraries for the same functionality using differentprecisions. For example, on the platforms we benchmarked, double precision and single precisionfloating point calculations are supported, as well as a variety of signed and unsigned integer numberformats. In order to achieve the best performance, code cannot be reused between single and doubleprecision libraries, because SSE/SSE2 instruction sets support double precision calculations with2-way vectors, and single precision calculations with wider, 4-way vectors.

In Fig. 6.6 we show the performance of non-threaded single and double precision librariesgenerated for the DFT and DCT-2. Single precision libraries achieve better performance than thedouble precision counterparts due to wider 4-way vectorization, however, due to the two currentlimitations of the generator, they are not as fast as they could be in principle.

First, the library generator does not vectorize the base cases (with the exception of doubleprecision DFT). Since, the base cases are fixed size transform, adding support for vectorizationmostly requires some integration effort. As a consequence, due to the fact that non-vectorized singleprecision code is slower than the non-vectorized double precision code, single precision libraries areslower than the double precision ones for the transform sizes ≤ 64, which use non-vectorized basecases.

Second, some SPL vectorization rules reported in [54] relevant to the 4-way vectorization arenot ported to

∑-SPL, and thus the 4-way vectorization is somewhat suboptimal.

As we previosly saw in Section 6.2.1 (with 2 threads and double precision), for both single anddouble precision the generated DFT libraries have performance close to FFTW and IPP, and thegenerated DCT-2 libraries are much faster, due to the native DCT algorithm.


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


Figure 6.6: Comparison of generated libraries for DFT and DCT-2 in double (2-way vectorized) and single (4-wayvectorized) precision. No threading.

6.2.4 Scalar Code

For best performance the library should be vectorized and multithreaded. However, it is also ofinterest to compare scalar (non-vectorized and non-parallelized) performance. Generally, there ismore than one way to vectorize and to parallelize, which is further complicated by various threadingoverheads, which makes direct performance comparisons difficult to interpret. If, in contrast, bothvectorization and parallelization are disabled, performance differences will be due only to differentalgorithm choices, the general code structure, and the overall level of library-specific and transform-specific optimizations. We compare to FFTW only, because a scalar version of IPP is not available.

Consider, the overall code structure. Both FFTW and the generated libraries decompose theoriginal transform recursively and terminate the recursion with base cases (“codelets” in FFTW).For both libraries, the base cases are large blocks of unrolled code, with at most one outer loop.

Consider the algorithms. The recursive DFT algorithms used by both libraries for for the scalarcode are the same (i.e. Cooley-Tukey), and the RDFT algorithms are very similar.

Now let’s compare the performance. Both the DFT and the RDFT performance is shown inFig. 6.7. The degree of similarity is striking. For sizes larger than 64, the performance of FFTWand of generated libraries for DFT and RDFT is practically identical. This led some people in ourgroup to coin the term “to generate FFTW.”

Both plots show a performance drop at transform size 128. The drop is due to the start of

6.3. HIGHER-DIMENSIONAL TRANSFORMS 123

0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


Figure 6.7: Generated scalar DFT and RDFT libraries versus scalar FFTW. No threading.

recursion, since both libraries provide base cases for sizes up to 64, and the size 128 has to gothrough at least one level of recursive split, introducing extra overhead.

For sizes 64 and smaller, however, the generated library is noticably faster, in particular inthe case of the RDFT, even though both libraries use automatically generated fully unrolled basecase code. We believe that this happens because the generated library uses base cases for a plainDFT/RDFT recursion step without any gather/scatter (recursion step 1 in Fig. 3.4), while FFTWreuses the base cases for strided gather/scatter DFTs (recursion step 2 in Fig. 3.4), which degradesperformance due to unnecessary index computation. The index computation performance penaltyis higher for smaller transforms, which have less floating point operations.

These performance results mean that not only the library generator does achieve the sameoverall optimization level as FFTW, it even can discover new kinds of optimizations. In this case,it even inadvertently discovered a new optimization, which increases performance of small transformsizes considerably.

From the practical standpoint this means that the library generator makes it possible to generatea library of the same quality as FFTW starting only from a high-level algorithm specification. Itstreamlines the generation process and makes it portable to other transforms and algorithms,without any human-effort. In particular, similar high quality libraries can be obtained for othertransforms, as we already saw with DCTs.

6.3 Higher-Dimensional Transforms

All linear transforms that we consider extend to multiple dimensions. The multidimensional productis in most cases simply a separable product of 1-D transforms performed along each dimension.Exceptions to this rule are the RDFT and DHT, which require some additional post-processing.

The separable product along each dimension is simply a tensor product of matrices. For exam-ple, given a 1-dimensional transform T 1, its 2-dimensional and 3-dimensional variants are

T 2 = T 1 ⊗ T 1, and T 3 = T 1 ⊗ T 1 ⊗ T 1.

Standard tensor product properties [21, 122] can now be used to manipulate multidimensionaltransforms and obtain breakdown rules. For example, the following decompositions are readily


0

2

4

6

8

10

12

8x8

8x16

16x8

16x1

6

16x3

2

32x1

6

32x3

2

32x6

4

64x3

2

64x6

4

64x1

28

128x

64

128x

128

128x

256

256x

128

256x

256

256x

512

512x

256

512x

512

image size

MDDFT, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

10

12

14

16

18

20

22

8x8

8x16

16x8

16x1

6

16x3

2

32x1

6

32x3

2

32x6

4

64x3

2

64x6

4

64x1

28

128x

64

128x

128

128x

256

256x

128

256x

256

256x

512

512x

256

512x

512

image size

MDDFT, 2 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

8x8

8x16

16x8

16x1

6

16x3

2

32x1

6

32x3

2

32x6

4

64x3

2

64x6

4

64x1

28

128x

64

128x

128

128x

256

256x

128

256x

256

256x

512

512x

256

512x

512

image size

MDDCT2, 2 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

10

12

14

8x8

8x16

16x8

16x1

6

16x3

2

32x1

6

32x3

2

32x6

4

64x3

2

64x6

4

64x1

28

128x

64

128x

128

128x

256

256x

128

256x

256

256x

512

512x

256

512x

512

image size

MDDCT2, 2 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


Figure 6.8: Generated 2-dimensional DFT and DCT-2 libraries, up to 2 threads. IPP does not provide doubleprecision 2-dimensional transforms.

obtained:

T 2 = T 1 ⊗ T 1 = (T 1 ⊗ I)(I ⊗ T 1), (6.3)

T 3 = T 1 ⊗ T 1 ⊗ T 1 = (T 1 ⊗ I ⊗ I)(I ⊗ T 1 ⊗ I)(I ⊗ I ⊗ T 1). (6.4)

The above rules are the standard way of implementing multidimensional transforms. However, it isjust one possibility out of many. Tensor product identities can be used to obtain other algorithms.For example, if each T 1 is factored into a product of sparse matrices, T 1 = A1 . . . Ak, we canobtain “vector-radix”-type multidimensional algorithms, by systematically interleaving the factors,as shown below for 2 dimensions:

(T 1 ⊗ T 1) = (A1 . . . Ak ⊗A1 . . . Ak) = (A1 ⊗A1) . . . (Ak ⊗Ak) (6.5)

The Ai are sparse, and the above split leads to a different algorithm. In some cases, it can reducethe arithmetic cost. For example, if some of Ai are diagonal matrices, say D, applying the identityD ⊗ D = D′ (where D′ is a new diagonal) saves operations compared to the standard algorithm(6.3).

We have generated 2-dimensional DFT and 2-dimensional DCT-2 libraries using (6.3). It iscommon practice to parallelize the outer loops over the 1-D transforms. However, an interesting

6.4. EFFICIENCY OF VECTORIZATION AND PARALLELIZATION 125

property of the library generator is that it can also vectorize these outer loops, instead of vectorizingthe loops inside the transform. This leads to less overhead associated with vector shuffles.

Fig. 6.8 shows the performance of the generated 2-dimensional DFT and DCT-2 libraries on theIntel Xeon 5160. As a common case scenario the libraries use up to 2 threads and we show bothdouble and single precision. FFTW supports both variants, and IPP has 2-dimensional transformsonly as part of its image processing domain and does not support double precision.

For the DFT, the generated libraries are noticeable faster for smaller sizes, where the 1-dimensional building blocks are implemented as base cases, and the reduced shuffle overhead im-proves the performance. Once one of the dimensions is larger than 64 the 1-dimensional transformsneed to further recurse to reach the base cases, and the increased overhead reduces the performance.At larger sizes the entire data set no longer fits into L2 cache, and at this point vectorizing theouter loop severely degrades the performance of the generated library. In addition, as we alreadydiscussed, some necessary optimizations to reduce cache thrashing are not implemented.

As the 1-dimensional DCT-2 described in the previous section, the generated libraries for 2-dimensional DCTs are much faster than the competition. The main reason is as before our use ofthe general-radix Cooley-Tukey type algorithm for the 1-dimensional DCT. In addition, the libraryautomatically generates the appropriate DCT base cases (DCT codelets are missing in FFTW).Finally, the vectorization of the outer loop also improves the performance.

6.4 Efficiency of Vectorization and Parallelization

In this section, we briefly discuss the efficiency of our vectorization and parallelization.

Vectorization. We demonstrate the performance gain of vectorization in Fig. 6.9. The smallertransform sizes (≤ 64) are not vectorized in the generated libraries (except the double-precisionDFT) due to a current limitation of the generator. As an artifact of this limitation, for smalltransform sizes it appears that 4-way vectorization leads to slowdown. This is not true, and theslowdown is due to the lack of vectorization, since scalar single precision code is slower than scalardouble precision code, and 4-way vectorized libraries are single precision libraries.

Overall the vectorization payoff depends on the regularity of the algorithm. The best speedupsare assocaiated with more regular algorithms, like the RDFT and DFT. The least regular algorithmis for the DCT-2, and the vectorization gain is lowest.

Parallelization. The parallelization speedup for the DFT and the real DFT (other trigono-metric transforms have similar behavior) is shown in Figures 6.9–6.11. On our main benchmarkplatform (Intel Xeon 5160, Fig. 6.9) we obtain good performance scaling when going from 1 to 2threads due to the shared L2 cache between the pair of cores, and no further speedup when going to4 threads. The L2 cache is only shared between 2 cores, and the communication between the corepairs has to happen through the main memory bus. Thus, the CPU speed / memory bandwidthratio appears to be the bottleneck.

To give some reference points we performed the same experiment on the new Intel platform,based on the Core 2 Extreme QX9650 processor, codenamed “Penryn”. The processor providesbetter memory bandwidth, and the plot in Fig. 6.11 now shows a speedup with 4 threads over 2threads.

Our automatic parallelization was also evaluated on earlier platforms in [53] showing a linearspeedup with 4 processors.


0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s)Performance [Gflop/s] Intel Xeon 5160, 3000 MHz

No vectorization2-way4-way

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s)Performance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s)Performance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT4, 1 thread(s)Performance [Gflop/s] Intel Xeon 5160, 3000 MHz


Figure 6.9: Vectorization efficiency. Platform: Intel Xeon.

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz

1 thread2 threads4 threads

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 2-way vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


Figure 6.10: Parallelization efficiency. 4 threads give no speedup over 2 threads, due to low memory bandwidth.Platform: Intel Xeon 5160.

6.5. FLEXIBILITY 127

0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 2-way vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 2-way vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


Figure 6.11: Better parallelization efficiency due to better memory bandwidth. Platform: Intel Core 2 ExtremeQX9650.

6.5 Flexibility

One of the big advantages of library generation is the flexibility that it provides by enabling thegeneration of different custom libraries. This flexibility is not available in fixed hand-written li-braries. We have already shown the versatility of the library generator with respect to the datatype, novel transforms and novel algorithms, different types of vectorization and parallelization,and different transform size classes.

Here we overview the additional kinds of customization enabled by the library generator, discusstheir benefits, and show several relevant examples.

We discuss three main types of customization: functional, qualitative and backend customiza-tion. Functional customization is concerned with extending or reducing the functionality of thelibraru. Qualitative customization is used to modify some other non-functional properties of thelibrary. When a library in different language is desired, backend customization is needed. In allthree cases we give concrete examples of how our library generator can provide the given kind ofcustomization.

6.5.1 Functional Library Customization

If some library functions do not exactly match the user requirements, there is often no othersolution, but to either reimplement the desired functions or, if possible, preprocess the input orpostprocess the output of the library function. Reimplementing is often not practical and mightnot result in good performance, and pre- or postprocessing also leads to a performance penalty.

In the case of open-source libraries, like FFTW, it is possible to modify the library itself,however, it is not practical for an average user, due to the high software complexity.

The most common examples of functionality customization include the choice of the data type(e.g. double or single precision, fixed or floating point), input or output data format (e.g. splitor interleaved complex, different packing of symmetric complex sequences from the output of realDFT, out of order data), minor changes in the transform definition (e.g. scaling in DFTs andDCTs).

As an example of functional customization, we considered the well-known problem of scalingthe outputs of the Fourier transforms (DFT and RDFT). The common definition of forward and


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length


IPP 5.2, no divIPP 5.2, 1/nGenerated, no divGenerated, 1/n

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz

IPP 5.2, unscaledIPP 5.2, 1/nGenerated, unscaledGenerated, 1/n

Figure 6.12: Generated RDFT libraries with built-in scaling. Generated libraries incur no performance penalty.

inverse DFT uses unscaled roots of unity, namely

DFTn =[ωij

n

], DFT−1

n =[ω−ij

n

], ωn = e−2

√−1π/n.

Simple computation shows that applying forward and inverse DFT in succession result leads backto the original input scaled by n. For many applications this is acceptable, and these transformsare called unscaled . However, if the true inverse is desired the output of either the forward orinverse DFT must be scaled by 1/n, or both must be scaled by 1/

√n.

Conider, for example, the real DFT in Intel IPP. The library provides different scaling modes.In Fig. 6.12 we show the performance of unscaled (“no div”) and scaled (“1/n”) IPP modes in bothsingle and double precision. The scaling introduces an average performance penalty of 10% in thesingle precision case and 20% in the double precision case. Since the scalar multiplication can notpossibly introduce a consistent 10% or 20% slow down to an O(n log n) algorithm, the scaling isprobably implemented as a separate pass over the data set, which introduces unnecessary memorytraffic.

The generated libraries in Fig. 6.12 incur no scaling overhead. The scaling is incorporateddirectly into the recursion steps (and thus codelets) using the rewrite rules which merge the diagonalmatrices with adjacent loops.

6.5.2 Qualitative Library Customization

The common example of qualitative customization include trading off smaller code size for slowerruntime, optimizing the average performance versus optimizing the best possible performance (e.g.at specific transform sizes), trading off accuracy for runtime, etc.

Most of the qualitative customization are not usually possible with fixed libraries, withoutreimplementation.

As an example of qualitative customization in Fig. 6.13 we show the code size versus performancetradeoff in generated libraries. The different lines in the plot are obtained by changing the numberof generated base case sizes for each recursion step type.

The code sizes of generated libraries are rather suboptimal, so the absolute numbers fromFig. 6.13 are not directly comparable to optimized human-written code, and especially to the codeoptimized for size. The code size can be dramatically reduced by reducing the size of the recursionstep closure. In this case and most other cases this reduction comes without any performance cost,


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

1.1 KLOC

FFTW, 155 KLOC

Complex DFT, double precision (no vectorization)


1.3 KLOC

1.9 KLOC

3.3 KLOC

13 KLOC

Figure 6.13: Performance vs. code size tradeoff in a generated DFT library. KLOC = kilo (thousands) lines ofcode. Generated libraries are for the DFT and 2-power sizes only. FFTW includes other transforms and supports alltransforms sizes.

because many similar recursion steps can be unified into a single recursion step, similarly to howthe scaling can be incorporated without any additional performance penalty, as shown in Fig. 6.12.Reducing the size of the closure reduces the code size due to the smaller number of recursions andbase cases.

6.5.3 Backend Customization: Example, Java

Decoupling of transform specific and platform specific parts of the library generator allows us toeasily retarget the generator to generate code in programming languages other than C++. Inthis section we describe our latest work, which provided the Java backend. The backend wasimplemented in a mere 2 days of work by Frederic de Mesmay. However, the implications aretremendous. This allows, us to automatically generate the high-performance Java libraries forlinear transforms.

In the rest of this section we describe the motivation for implementing numeric libraries in Java,alternative implementation approaches, and then benchmark our generated Java libraries.

Motivation. Recently, Java has emerged as one of the most popular programming languagesfor many applications. Java’s slogan “Compile once — run everywhere” represents a paradigm shiftfrom the traditional compiled programming languages. Java programs are compiled into portablebyte-code, which can be executed by a special interpreter, called Java Virtual Machine (JVM).This allows Java byte-code files to be executed on a variety of platforms without recompilation.

Another important aspect of the Java is the extensive standardized APIs offering functionalityranging from standard data structures to graphical user interfaces (GUIs) and network access.Coupled with the portability provided by execution on the JVM, Java provides a true virtualplatform.

In addition, JVM as the additional layer of indirection, offers additional important features.For example, JVM provides the controlled environment, where the running applications can be


restricted from full access to the host platform for security reasons; JVM performs consistencychecks, which prevent common errors, such as out-of-bounds array accesses and buffer overflowattacks; security and robustness guarantees in addition to the true portability allows JVM to safelyexecute remote code. Thus, Java enables true heterogeneous distributed computing.

Thus, Java as a development platform became a popular choice in areas ranging from educationto large-scale software development.

Interfaces to native libraries. Java programs can invoke native machine-specific libraries,such as FFTW and IPP, by using Java Native Interface (JNI). In this case, special wrappers mustbe provided for the native libraries. Even though this approach will most likely lead to the bestperformance, there are several disadvantages of this approach, stemming from the lack of virtualizedJVM. Namely, the resulting Java application is no longer fully portable, it can not provide safetyand robustness guarantees, and the resulting code is not “mobile”, i.e. can not be executed by aremote machine, which does not have the machine-specific library.

Java numeric libraries. To fully take advantage of the Java and the JVM execution it makessense to write numeric libraries in Java. In the context of numerical libraries and applications Javahas been studied in [7, 26,28,29,129].

Unfortunately, less effort is spent on Java numeric libraries and thus fewer optimized librariesexist, and a lot of important functionality is missing. In the domain of linear transforms, we foundonly two competing implementations: open-source JTransforms [128] and commercial JMSL [88].JTransforms is a high performance library for linear transforms, which provides Java implemen-tations of 1-, 2- and 3-dimensional real and complex DFT, DCT and DST. JTransforms is basedon highly optimized C implementation from Takuya Ooura [92]. Wendykier in [129] describes theapplication of JTransforms to large scale image deblurring. JMSL is a commercial Java numericlibrary sold by Visual Numerics, JMSL has a wide range of functionality ranging from linear algebrato neural networks. JMSL has 1-dimensional real and complex DFT. We did not benchmark JMSL,due to lack of time.

We did not find any optimized libraries for FIR filters, wavelets, WHT and DCT-4.

Benchmarks. We benchmarked four generated Java libraries: DFT, RDFT, DCT-2 and anFIR filter. All libraries were non-threaded and non-vectorized. Vectorization is not possible inJava, because it is a cross-platform language, and thus does not provide hardware specific SIMDinstructions. Threading, however, is supported, but our simple Java backend can’t produce threadedJava code at the moment.

Tests were done on the Intel Xeon 5160 machine using Sun Java JDK 1.6 We compared DFT,RDFT and DFT-2 against JTransforms, which was restricted to 1 thread for an apples-to-applescomparison. FIR filter is compared to a naive double-loop implementation. The results are shownin Fig. 6.14.

The generated DFT, RDFT and DCT-2 libraries outperform JTansforms, except for the largerDFT sizes, where JTransforms is up to 10% faster. As before the generated libraries don’t yetincorporate any optimization for out of L2 cache transforms, and thus we only show sizes that fitinto the L2 cache. The slowdown at larger DFT sizes seems to come from the Java timer instability,which causes the runtime adaptation mechanism in the generated libraries to find suboptimal results(slower than the simplest model we have tried).

For FIR filters we did not find and competing high-performance Java library and thus webenchmark against a naıve double-loop Java implementation. The generated library is between 2to 5.5 times faster than the naıve FIR code.


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length


JTransformsGenerated library

0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

20 80 320 1280 5120 20480 81920

Length

8-tap FIR Filter, downsampling=1, 1 thread(s)Performance [Gflop/s] Intel Xeon 5160, 3000 MHz

NaiveGenerated

Figure 6.14: Generated Java libraries performance: trigonometric transforms.

While the overhead of the JVM interpreter is very high, today most JVMs employ JIT (just-in-time) compilers, which translate byte-code into native machine code at runtime. JIT compilersresulting in near native performance of Java applications, however some overhead is still unavoid-able, because of additional runtime consistency and safety checks imposed by the JVM. In the caseof numeric code, the most important check is the bounds check of array indices.

As a result of these checks, the generated DFT, RDFT and DCT-2 libraries perform about 1.5– 2 times slower than the corresponding native (non-vectorized) C++ code. However, the FIRfilter performs almost as well as the C++ implementation. We conjecture that most of the runtimearray index bounds checking can be eliminated by the JIT compiler, since the FIR filter code hassimple index expressions.

JTransforms complexity. JTransforms employs an iterative split-radix FFT algorithm withlots of specialized functions that implement different parts of the algorithm. Most of the functionsrepresent a DFT transform possibly with a loop around it with some additional related computa-tions, e.g., a permutation or a scaling by twiddle factors. These functions are equivalent to ourrecursion steps, and the full set of them to the recursion step closure. However, in this case theclosure was derived manually. For comparison of the complexity of the implementation we showthe recursion step count for JTransforms in Table 6.4. This table must be compared with Table 6.2.

The library has more recursion steps but overall less code than the generated libraries, becausethe base cases are provided for size-4 transforms only.


number of recursion steps total code size

Transform scalar parallelized

DFT + RDFT 25 28 3.0 KLOC / 0.08 MbDCT-2 + DCT-3 25 28 2.8 KLOC / 0.08 Mb

Table 6.4: Number of recursion steps in JTransforms Java library.

6.5.4 Other Kinds of Customizaton

There are other numerous kinds of customization that are possible with the generator. In particular,it is possible to generate adaptive and non-adaptive libraries. More specifically, the standardlibraries that we benchmarked all include three runtime phases, that the library goes through inorder to compute the result:

1. Planning phase. In this phase the values for the degrees of freedom within the breakdownrules that lead to the best performance are chosen.

2. Initialization phase. Here, the necessary data (e.g. the twiddle factors for the Cooley-TukeyFFT) is precomputed, and if temporary memory buffers are needed, they are allocated.

3. Compute phase. The actual transform is computed.

It might be useful to generate, besides the regular three-phase library, also the two-phase and eventhe single-phase libraries. The two-phase library would exclude the planning phase, and thus it isa non-adaptive library. The single-phase library in addition to being non-adaptive, does not havethe precomputation stage.

Another possible kind of customization is the generation of the specific library interface. Thiswould be achieved by modifying the backend, and thus can also be called backend customization.For the DFT, for instance, there are three popular interfaces: DFTI [119], VSIPL [69, 78], andFFTW [61].

6.6 Detailed Performance Evaluation

In Appendix A we provide performance plots with exhaustive benchmarks for the following crossproduct:

no vectorization2-way vectorization4-way vectorization

×


×

DFTDHTRDFTDCT-2DCT-3DCT-4

FIR filterDownsampled FIR filter

On three platforms: Intel Xeon 5160 (Section A.1), AMD Opteron 2220 (Section A.2), and IntelCore 2 Extreme QX9650 (Section A.3). We did not benchmark non-vectorized FIR filter libraries,and did not benchmark any FIR filter libraries on Core 2 Extreme QX9650.

6.6. DETAILED PERFORMANCE EVALUATION 133

No vectorization corresponds to double precision, 2-way vectorization corresponds to doubleprecision using 2-way SSE2 intrinsics, and 4-way vectorization corresponds to single precision using4-way SSE and SSE2 instrinsics.

Chapter 7

Future Work and Conclusions

We restate from the introduction

The goal of this thesis is the computer generation of high performance librariesfor the entire domain of linear transforms. The input is a high level algorithmspecification, request for multithreading, and vector length. The output is anoptimized library with runtime platform adaptation mechanism.

In this thesis we described a prototype that achieves the above goal. This goal and the au-tomation of high-performance library development in general is a problem at the core of computerscience. We have shown that for an entire domain of structurally complex algorithms, highest per-formance libraries can be generated automatically from the algorithm specification (e.g., breakdownrules in Table 2.3). The key is a properly designed domain-specific language (

∑-SPL in our case)

that makes it possible to perform all difficult tasks (extraction of recursion steps, vectorization,etc.) at a high level of abstraction using rewriting and other techniques.

The generated libraries in all cases have performance comparable to hand-written code, andin many cases outperform the hardware vendor libraries, which are implemented by experts, whounderstand the platform very well.

We have based our work on Spiral, the code generator for linear transforms, and as part ofthis thesis extended Spiral capabilities in three key areas: 1) we enabled generating efficient loopcode by introducing

∑-SPL and loop merging; 2) we enabled generating general-size transform

code by automatically analyzing the breakdown rules and computing the recursion step closure; 3)we provided a general vectorization and parallelization framework to deal with increasing on-chipparallelism of modern computer platforms.

Clearly, a lot can be done to improve the existing functionality. Looking into the future,two main directions can be identified: extending the range of supported hardware platforms andcomputing paradigms and extending the supported functionality beyond transforms. In fact, firststeps in both of these directions were already made by other members of the Spiral group.

Given the limitations of clock frequency scaling and of conventional vector and thread par-allelism, it appears unavoidable that new computing paradigms will become adopted in the fu-ture. For example, today we see the emergence of hardware accelerator technologies, like on-chipgraphics processor units (GPUs) being used for general numeric computing and on-board FieldProgrammable Gate Arrays (FPGAs). The high-level domain-specific and declarative nature of

135

136 CHAPTER 7. FUTURE WORK AND CONCLUSIONS

SPL and∑

-SPL provide a convenient basis for capturing new paradigms. For example, [83, 101]uses Spiral, SPL breakdown rules and rewriting to generate efficient DFT implementations onFPGAs; [41] uses Spiral to perform automatic software-hardware partitioning; [30] describes theapplication of Spiral to generate code for distributed memory workstation clusters using MPI; [36]uses Spiral to generate code for the Cell processor; [58] shows some preliminary results on usingSpiral to generate GPU-accelerated implementations.

To capture a larger domain of numeric functionality, [57] proposes the operator language, whichextends SPL and

∑-SPL to non-transform non-linear functionality. If successful, this work may

enable automation in other numerical domains, relevant for signal processing, communications, orscientific computing. linear transforms, such as, for instance, linear algebra.

Next we provide a more detailed overview of the contributions of this thesis, current limitations,and future work.

7.1 Major Contributions

We summarize the major contributions of this thesis below.

• Looped code generation and∑

-SPL. The loop merging framework described in Sec-tion 2.4 enables generating efficient looped code for a wide variety of transforms. The formal-ism enables elimination of redundant loops for well-known algorithms with well-understooddata reordering like Cooley-Tukey FFT (strides), and lesser known algorithms, like thegeneral-radix fast DCT, Rader FFT and prime-factor FFT. This part is joint work withFranz Franchetti, and is also reported in [52].

• General size code generation via the recursion step closure. Our approach to generalsize code generation is described in Chapter 3. The key to the ability to generate general sizelibrary is being able to statically (i.e., at code generation time) compute the finite recursionstep closure. This enables the generation of full transform libraries similar to FFTW, ratherthan a collection of a few fixed transform sizes.

• Parallelization and vectorization framework. The parallelism framework is described inChapter 4. It enables the generation of code that exploits two kinds of parallelism present inmodern computers, namely SIMD vector parallelism and MIMD thread parallelism. Usually,high-performance numeric software relies on manual programming to exploit this availableparallelism, instead of relying on a general purpose compiler. We automate this time con-suming process, and have shown that the generated libraries have competitive performancewith hand-written code. This part builds on work with Franz Franchetti, and is also reportedin [53,54].

• Full automation. We demonstrate a system prototype that achieves full automation inlibrary implementation. This is facilitated by the library plan construction (described inChapter 5), which combines multiple implementations of each recursion step and enablesthe generation of a single object (e.g., a C++ class) for each recursion step. Further, thelibrary plan enables the “whole-library” analysis, which we use to automatically partition theparameters into hot (available at compute time) and cold (available at initialization time).Such analyses are not possible if the generator is not aware of the full set of recursion stepimplementations that form the library.

7.2. CURRENT LIMITATIONS 137

Finally, we want to note that generators by nature are much more flexible than hand-writtencode. The flexibility in terms of the choice of transform/algorithm is demonstrated through anumber of different experiments in Chapter 6. The flexibility in customizing one specific type offunctionality is discussed in Section 6.5.

7.2 Current Limitations

Our current prototype implementation has a number of limitations, all of which should be regardedas the short-term “future work”.

• SPL vectorization rules are not completely ported to∑

-SPL, some breakdown rules are stillmissing. This results in suboptimal performance, in particular, with single precision (4-wayvectorized) libraries.

• Base cases of the libraries are not vectorized (with the exception of double precision DFT).This results in suboptimal performance for small sizes in vectorized libraries, the problem isespecially visible in single precision (4-way vectorized) libraries.

• The runtime search over alternative breakdown rules is not implemented, instead the libraryuses the first applicable breakdown rule. However, the library can search over the degrees offreedom within one breakdown rule (i.e., the radix in the Cooley-Tukey FFT (2.1)). Again,this may result in suboptimal performance in some cases. It also precludes using a differentalgorithm for transforms whose data set is not L2 cache resident, and results in severe per-formance degradation in this case. However, a working prototype of this kind of search wasrecently implemented by Frederic de Mesmay, and promises to fix these problems.

• In large-size (non L2 cache resident) DFT related transforms standard algorithms are notalways suitable due to large 2-power strides. We could not, however, provide alternativealgorithms, since the libraries did not provide full adaptation as mentioned in the previousbullet, and thus could not choose the better algorithm.

• Transforms cannot be invoked from within the initialization of other transforms. This isneeded to implement frequency domain FIR filtering algorithms, Rader and Bluestein FFT,and other algorithms. This limitation is due to the fact that the precomputations requiringtransforms in Spiral are hard-coded to be performed at code generation time. To solve thisproblem the affected breakdown rules in Spiral must be updated.

• Constraint solving in parametrization (Section 3.3.3) requires at least a full linear systemsolver. We have not incorporated a full solver, since in most cases the systems are close tobeing trivial. In the most general case, the system of equations may not be linear. However,the non-linearity we have encountered was restricted to equations of the form u1u2 = u3, inwhich case we could simply treat the product u1u2 as one variable, since in those cases therewere no separate constraints on u1 and u2. Finally, we do not capture all of the constraintsin the parametrization; for example, some

∑-SPL expressions that correspond to square

matrices, become matrices of size u1 × u2, with no known guarantee that u1 = u2. We havenot encountered a problem related to this, and it does not happen if GTI (Section 3.8) isused.


7.3 Future Work

In the longer term, we see three main research directions. The first direction is implementing newoptimizations, increasing the coverage of existing optimizations, and implementing other extensionswithin the current domain of linear transforms. The second direction is extending our results tonew domains, for example, using the operator language [57]. The third direction is aligning thecapabilities of the generator with the needs of the industry. Broadly speaking, this means increasingthe robustness of the generator and the generated code.

New optimizations and other extensions. A number of important optimizations can beimplemented within the generator to improve the generated code. We list some possibilities:

• Integrate offline and online search. By offline and online we mean the search that occursat code generation time and library initialization time respectively. Currently, we have avery restricted integration between these two adaptation mechanisms. The transform basecases are platform-adapted using offline search, and the best recursion is determined usingonline search. However, much more is possible. For example, multiple alternative breakdownrules can be compiled into a single adaptive library. However, it might be the case that onebreakdown rule consistently outperforms the others. If this can be determined offline, thelibrary can be compiled with only one breakdown rule, reducing the overall code size.

• Improve storage management. The library generator needs more robust support of in-place transforms, and more efficient management of temporary storage.

• Using automatically constructed performance models instead of runtime platformadaptation. Originally we provided a hand-tuned model instead of search to select the bestvalues for the degrees of freedom in the library. The hand-tuned model was always verysuccessful for a given library (within 10% of search). However, changing the platform, datatype, vectorization, number of threads, transform or breakdown rules always led to the needto adjust the model. Given the success of models, it would really be useful to automaticallyconstruct such models using machine learning techniques, in the spirit of work by Singer andothers [109]. This could be considered a form of offline search.

• Implement software pipelining. All Intel and AMD processor based platforms that webenchmarked are out-of-order processors, with very small number of registers. On theseplatforms software pipelining is not beneficial. However, on other platforms with processorswith in-order execution, it may make a noticeable performance difference.

• Improve scheduling for straightline code. Clearly, software pipelining is tied to codescheduling. However, scheduling is also useful without software pipelining. For example,even on out-of-order processors, scheduling can improve performance by grouping distantindependent instructions in order to overcome the size limitation of the reorder buffer. Otherfactors may also contribute to performance. For example, Intel processors seem to favor blocksof loads followed by blocks of stores, while on other platforms code sequences with interleavedloads and stores may perform faster. Generally speaking, scheduling is the task of the C++compiler, however, often it does not do it optimally, and the order of statements in thegenerated code can affect the final outcome. In a generator one could try different schedulingstrategies on a given platform. One example of scheduling statements in straightline C codeis described in [59].

7.3. FUTURE WORK 139

Other domains. Some of the work we have done as part of this thesis is not specific to lineartransforms. The essence of our approach are specially designed domain-specific languages (SPL,∑

-SPL, index-free∑

-SPL, intermediate code representation), multi-layered rewriting, and onlineand offline search mechanisms which use alternative breakdown rules and/or degrees of freedomwithin the breakdown rule. This approach can be applied to other domains, for which a suitabledomain-specific language exists. One example is described in [57], which proposes an extension toSPL and

∑-SPL which can potentially capture several other numeric domains.

Needs of the industry. As we learned from [11] being able to generate fast numeric code isnot yet sufficient to enable inclusion into a commercial numeric library, like Intel IPP or MKL.

We give just a small sample of issues that must be addressed:

• SIMD vectorized code must support both aligned and unaligned input and output vectors.(This is usually handled by specialization.)

• Code must adhere to preexisting library conventions, e.g., concerning memory allocation, inthreading, etc.

• Besides supporting multitude data types, which Spiral does rather well, hybrid functionsmight be needed, where the input data type (say, 16-bit integer) is different from the outputdata type (say, 32-bit floating point).

• Due to the large amounts of generated code (potentially, thousands of functions), it must bedirectly pluggable into the existing library without any further modifications. In particular,function and file naming must conform to some standard naming scheme, processor dispatchmust be handled, makefiles must be automatically generated, etc.

• The generated code must adhere to a certain predefined interface. For example, DFTI [119] inthe case of DFTs, or the internal library interface, when a lower-level primitive is generated.

Any one of the individual issues above do not present a serious research challenge. However, incombination, they require a systematic approach for their efficient, extensible, and scalable solution.Our separation of the library generator into Library Structure and Library Implementation modules(Fig. 3.1 in Chapter 3) is the first step in this direction.

Appendix A

Detailed Performance Evaluation

A.1 Intel Xeon 5160

141

142 APPENDIX A. DETAILED PERFORMANCE EVALUATION

0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.1: Generated libraries performance: trigonometric transforms, no vectorization (double precision), nothreading. Platform: Intel Xeon 5160.

A.1. INTEL XEON 5160 143

0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.2: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision), nothreading. Platform: Intel Xeon 5160.


0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Xeon 5160, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.3: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), nothreading. Platform: Intel Xeon 5160.


0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.4: Generated libraries performance: trigonometric transforms, no vectorization (double precision), up to2 threads. Platform: Intel Xeon 5160.


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.5: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision), upto 2 threads. Platform: Intel Xeon 5160.


0

2

4

6

8

10

12

14

16

18

20

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.6: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), upto 2 threads. Platform: Intel Xeon 5160.


0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.7: Generated libraries performance: trigonometric transforms, no vectorization (double precision), up to4 threads. Platform: Intel Xeon 5160.


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.8: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision), upto 4 threads. Platform: Intel Xeon 5160.


0

2

4

6

8

10

12

14

16

18

20

22

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.9: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), upto 4 threads. Platform: Intel Xeon 5160.


0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



Figure A.10: Generated libraries performance: FIR filter, downsampling = 1, varying number of taps, up to 1thread(s). Platform: Intel Xeon 5160.

0

1

2

3

4

5

6

7

8

9

10

20 80 320 1280 5120 20480 81920

Length



0

1

2

3

4

5

6

7

8

9

20 80 320 1280 5120 20480 81920

Length



Figure A.11: Generated libraries performance: FIR filter, varying length, up to 1 thread(s). Platform: Intel Xeon5160.


0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

18

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

18

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

18

20

10 20 30 40 50 60

Taps




0

2

4

6

8

10

12

14

16

18

20 80 320 1280 5120 20480 81920

Length



0

2

4

6

8

10

12

14

16

18

20 80 320 1280 5120 20480 81920

Length





0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



0

5

10

15

20

25

10 20 30 40 50 60

Taps



0

5

10

15

20

25

30

35

10 20 30 40 50 60

Taps



0

5

10

15

20

25

30

35

40

10 20 30 40 50 60

Taps




0

5

10

15

20

25

30

35

20 80 320 1280 5120 20480 81920

Length



0

5

10

15

20

25

30

35

20 80 320 1280 5120 20480 81920

Length





0

1

2

3

4

5

6

7

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

10 20 30 40 50 60

Taps




0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

20 80 320 1280 5120 20480 81920

Length



0

1

2

3

4

5

6

7

20 80 320 1280 5120 20480 81920

Length





0

1

2

3

4

5

6

7

8

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

10 20 30 40 50 60

Taps




0

1

2

3

4

5

6

7

8

9

20 80 320 1280 5120 20480 81920

Length



0

2

4

6

8

10

12

14

20 80 320 1280 5120 20480 81920

Length





0

1

2

3

4

5

6

7

8

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

18

20

10 20 30 40 50 60

Taps



0

5

10

15

20

25

10 20 30 40 50 60

Taps



0

5

10

15

20

25

30

10 20 30 40 50 60

Taps




0

2

4

6

8

10

12

14

16

18

20 80 320 1280 5120 20480 81920

Length



0

5

10

15

20

25

30

20 80 320 1280 5120 20480 81920

Length




A.2. AMD OPTERON 2220 157

A.2 AMD Opteron 2220


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.22: Generated libraries performance: trigonometric transforms, no vectorization (double precision), nothreading. Platform: AMD Opteron 2220.


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.23: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision),no threading. Platform: AMD Opteron 2220.


0

1

2

3

4

5

6

7

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.24: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), nothreading. Platform: AMD Opteron 2220.


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.25: Generated libraries performance: trigonometric transforms, no vectorization (double precision), upto 2 threads. Platform: AMD Opteron 2220.


0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.26: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision),up to 2 threads. Platform: AMD Opteron 2220.


0

1

2

3

4

5

6

7

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

7

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.27: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), upto 2 threads. Platform: AMD Opteron 2220.


0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.28: Generated libraries performance: trigonometric transforms, no vectorization (double precision), upto 4 threads. Platform: AMD Opteron 2220.


0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.29: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision),up to 4 threads. Platform: AMD Opteron 2220.


0

1

2

3

4

5

6

7

8

9

10

11

12

13

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

7

8

9

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

7

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

1

2

3

4

5

6

7

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.30: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), upto 4 threads. Platform: AMD Opteron 2220.


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

10 20 30 40 50 60

Taps



0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

10 20 30 40 50 60

Taps



Figure A.31: Generated libraries performance: FIR filter, downsampling = 1, varying number of taps, up to 1thread(s). Platform: AMD Opteron 2220.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

20 80 320 1280 5120 20480 81920

Length

12-tap FIR Filter, downsampling=1, 1 thread(s)Performance [Gflop/s] AMD Opteron 2220, 2800 MHz


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

20 80 320 1280 5120 20480 81920

Length



Figure A.32: Generated libraries performance: FIR filter, varying length, up to 1 thread(s). Platform: AMDOpteron 2220.


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps




0

1

2

3

4

5

6

7

8

9

20 80 320 1280 5120 20480 81920

Length



0

1

2

3

4

5

6

7

8

9

20 80 320 1280 5120 20480 81920

Length





0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

18

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

16

18

10 20 30 40 50 60

Taps




0

2

4

6

8

10

12

14

16

18

20 80 320 1280 5120 20480 81920

Length



0

2

4

6

8

10

12

14

16

18

20 80 320 1280 5120 20480 81920

Length





0

0.5

1

1.5

2

2.5

3

3.5

10 20 30 40 50 60

Taps



0

0.5

1

1.5

2

2.5

3

3.5

10 20 30 40 50 60

Taps



0

0.5

1

1.5

2

2.5

3

3.5

10 20 30 40 50 60

Taps



0

0.5

1

1.5

2

2.5

3

3.5

10 20 30 40 50 60

Taps




0

0.5

1

1.5

2

2.5

20 80 320 1280 5120 20480 81920

Length



0

0.5

1

1.5

2

2.5

3

3.5

20 80 320 1280 5120 20480 81920

Length





0

0.5

1

1.5

2

2.5

3

3.5

4

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

10 20 30 40 50 60

Taps




0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

20 80 320 1280 5120 20480 81920

Length



0

1

2

3

4

5

6

7

20 80 320 1280 5120 20480 81920

Length





0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

10 20 30 40 50 60

Taps



0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

10 20 30 40 50 60

Taps



0

2

4

6

8

10

12

14

10 20 30 40 50 60

Taps




0

1

2

3

4

5

6

7

8

9

10

20 80 320 1280 5120 20480 81920

Length



0

2

4

6

8

10

12

14

20 80 320 1280 5120 20480 81920

Length




A.3. INTEL CORE 2 EXTREME QX9650 173

A.3 Intel Core 2 Extreme QX9650


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), non-vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.43: Generated libraries performance: trigonometric transforms, no vectorization (double precision), nothreading. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), 2-way vectorized, double precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.44: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision),no threading. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

12

14

16

18

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DFT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

RDFT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DHT, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length

DCT2, 1 thread(s), 4-way vectorized, single precisionPerformance [Gflop/s] Intel Core 2 Extreme QX9650, 3000 MHz


0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.45: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), nothreading. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.46: Generated libraries performance: trigonometric transforms, no vectorization (double precision), upto 2 threads. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.47: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision),up to 2 threads. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

12

14

16

18

20

22

24

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.48: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), upto 2 threads. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.49: Generated libraries performance: trigonometric transforms, no vectorization (double precision), upto 4 threads. Platform: Intel Core 2 Extreme QX9650.


0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.50: Generated libraries performance: trigonometric transforms, 2-way vectorization (double precision),up to 4 threads. Platform: Intel Core 2 Extreme QX9650.


0 2 4 6 8

10 12 14 16 18 20 22 24 26 28 30

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

18

20

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

18

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

16

18

20

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



0

2

4

6

8

10

12

14

4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k

Length



Figure A.51: Generated libraries performance: trigonometric transforms, 4-way vectorization (single precision), upto 4 threads. Platform: Intel Core 2 Extreme QX9650.

Bibliography

[1] ACM Conference Generative Programming and Component Engineering (GPCE), since 2002.

[2] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, andTools. Addison-Wesley Longman Publishing Co., Inc., 1986.

[3] Rami A. Al Na’mneh, W. D. Pan, and R. Adhami. Communication efficient adaptive matrixtranspose algorithm for FFT on symmetric multiprocessors. In Proc. Southeastern Symposiumon System Theory (SSST), pages 312–315, 2005.

[4] Rami A. Al Na’mneh, W. D. Pan, and R. Adhami. Parallel implementation of 1-D fastFourier transform without inter-processor communications. In Proc. Southeastern Symposiumon System Theory (SSST), pages 307–311, 2005.

[5] A. Ali, L. Johnsson, and D. Mirkovic. Empirical auto-tuning code generator for FFT andtrigonometric transforms. In Proc. ODES: 5th Workshop on Optimizations for DSP andEmbedded Systems, March 2007. in conjunction with International Symposium on CodeGeneration and Optimization (CGO).

[6] Randy Allen and Ken Kennedy. Automatic translation of Fortran programs to vector form.ACM Transactions on Programming Languages and Systems, 9(4):491–542, 1987.

[7] G. Almasi, F. G. Gustavson, and J. E. Moreira. Design and evaluation of a linear algebrapackage for java. In Proc. ACM conference on Java Grande, pages 150–159, 2000.

[8] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green-baum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society forIndustrial and Applied Mathematics, Philadelphia, PA, 3rd edition, 1999.

[9] F. Baader and T. Nipkow. Term Rewriting and All That. Cambridge University Press, 1998.

[10] D. H. Bailey. FFTs in external or hierarchical memory. J. Supercomputing, 4:23–35, 1990.

[11] Dmitry Baksheev, Victor Pasko, Andrey Bakshaev, Peter Tang, Boris Sabanin, and DavidKuck. Integration of Spiral-generated kernels into Intel MKL and IPP libraries, 2005–2008.collaborative effort with the Spiral group.

[12] U. Banerjee, R. Eigenmann, A. Nicolau, and D. A. Padua. Automatic program parallelization.Proceedings of the IEEE, 81(2):211–243, 1993.

183

184 BIBLIOGRAPHY

[13] D. Batory, C. Johnson, B. MacDonald, and D. von Heeder. Achieving extensibility throughproduct-lines and domain-specific languages: A case study. ACM Transactions on SoftwareEngineering and Methodology (TOSEM), 11(2):191–214, 2002.

[14] D. Batory, R. Lopez-Herrejon, and J-P. Martin. Generating product-lines of product-families.In Proc. Automated Software Engineering Conference (ASE), 2002.

[15] G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao,R. J. Harrison, S. Hirata, S. Krishanmoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooi-jen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceed-ings of the IEEE, 93(2), 2005. special issue on ”Program Generation, Optimization, andAdaptation”.

[16] G. Baumgartner, D. Bernholdt, D. Cociovora, R. Harrison, M. Nooijen, J. Ramanujan, andP. Sadayappan. A performance optimization framework for compilation of tensor contractionexpressions into parallel programs. In Proc. Int’l Workshop on High-Level Parallel Program-ming Models and Supportive Environments (held in conjunction with IEEE Int’l Parallel andDistributed Processing Symposium (IPDPS) ), 2002.

[17] K. Beauchamp. Applications of Walsh and Related Functions. Academic Press, 1984.

[18] T. Beer. Walsh transforms. American Journal of Physics, 49(5):466–472, 1981.

[19] Jon Bentley. Programming pearls: little languages. Communications of the ACM, 29(8):711–721, 1986.

[20] Glenn D. Bergland. Numerical analysis: A fast Fourier transform algorithm for real-valuedseries. Commun. ACM, 11(10):703–710, 1968.

[21] Dennis S. Bernstein. Matrix Mathematics. Princeton University Press, 2005.

[22] Eran Bida and Sivan Toledo. An automatically-tuned sorting library. Software — Practice& Experience, 37(11):1161–1192, 2007.

[23] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique Quintana-Orti, and Robertvan de Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions onMathematical Software, 31(1):1–26, March 2005.

[24] Paolo Bientinesi, Enrique Quintana-Orti, and Robert van de Geijn. Representing linearalgebra algorithms in code: The FLAME APIs. ACM Transactions on Mathematical Software,31(1), March 2005.

[25] Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Automatic intra-register vec-torization for the intel architecture. International Journal of Parallel Programming, 30(2):65–98, 2002.

[26] Brian Blount and Siddhartha Chatterjee. An evaluation of Java for numerical computing.Journal of Scientific Programming, 7(2):97–110, 1999.

BIBLIOGRAPHY 185

[27] L. I. Bluestein. A linear filtering approach to the computation of the discrete Fourier trans-form. IEEE Trans. on Audio and Electroacoustics, AU-18(4):451–455, 1970.

[28] R. F. Boisvert, J. Moreira, M. Philippsen, and R. Pozo. Java and numerical computing.Computing in Science & Engineering, 3(2):18–24, Mar/Apr 2001.

[29] Ronald F. Boisvert, Jack J. Dongarra, Roldan Pozo, Karin A. Remington, and G. W. Stewart.Developing numerical libraries in Java. Concurrency — Practice and Experience, 10(11–13):1117–1129, 1998.

[30] A. Bonelli, F. Franchetti, J. Lorenz, M. Puschel, and C. W. Ueberhuber. Automatic per-formance optimization of the discrete Fourier transform on distributed memory computers.In Proc. International Symposium on Parellel and Distributed Processing and Applications(ISPA), 2006. to appear.

[31] R. N. Bracewell. Discrete Hartley transform. J. Optical Society America, 73(12):1832–1835,1983.

[32] R. N. Bracewell. The fast Hartley transform. Proceedings of the IEEE, 72, 1984.

[33] V. Britanak and K. R. Rao. The fast generalized discrete Fourier transforms: A unifiedapproach to the discrete sinusoidal transforms computation. Signal Processing, 79(2):135–150, 1999.

[34] P. Burgisser, M. Clausen, and M. A. Shokrollahi. Algebraic Complexity Theory. Springer,1997.

[35] Rohit Chandra, Ramesh Menon, Leo Dagum, David Kohr, Dror Maydan, and Jeff McDonald.Parallel Programming in OpenMP. Morgan Kaufman, 2000.

[36] Srinivas Chellappa, Franz Franchetti, and Markus Puschel. FFT program generation for theCell BE. In International Workshop on State-of-the-Art in Scientific and Parallel Computing(PARA), 2008.

[37] W. H. Chen, C. H. Smith, and S. C. Fralick. A fast computational algorithm for the discretecosine transform. IEEE Trans. on Communications, COM-25(9):1004–1009, 1977.

[38] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourierseries. Math. of Computation, 19:297–301, 1965.

[39] Krzysztof Czarnecki and Ulrich Eisenecker. Generative Programming: Methods, Tools, andApplications. Addison-Wesley, 2000.

[40] Krzysztof Czarnecki, John O’Donnell, Jorg Striegnitz, and Walid Taha. DSL implementationin metaocaml, template haskell, and C++. In Batory, Consel, Lengauer, and Odersky, editors,Dagstuhl Workshop on Domain-specific Program Generation, LNCS. 2004.

[41] Paolo D’Alberto, Peter A. Milder, Aliaksei Sandryhaila, Franz Franchetti, Ja mes C. Hoe,Jose M. F. Moura, Markus Puschel, and Jeremy Johnson. Generating fpga accelerated dftlibraries. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM),2007.

186 BIBLIOGRAPHY

[42] Alain Darte. On the complexity of loop fusion. Parallel Computing, 26(9):1175–1193, 2000.

[43] Ingrid Daubechies and Wim Sweldens. Factoring wavelet transforms into lifting steps. Journalof Fourier Analysis and Applications, 4(3):247–269, 1998.

[44] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, C. Whaley, andK. Yelick. Self adapting linear algebra algorithms and software. Proceedings of the IEEE,93(2), 2005. special issue on ”Program Generation, Optimization, and Adaptation”.

[45] Nachum Dershowitz and David A. Plaisted. Rewriting. In A. Robinson and A. Voronkov,editors, Handbook of Automated Reasoning, volume 1, chapter 9, pages 535–610. Elsevier,2001.

[46] Alexandre E. Eichenberger, Peng Wu, and Kevin O’brien. Vectorization for SIMD architec-tures with alignment constraints. In Proc. ACM PLDI, June 2004.

[47] D. F. Elliott and K. R. Rao. Fast Transforms: Algorithms, Analyses, Applications. AcademicPress, 1982.

[48] FFTW web site. www.fftw.org.

[49] F. Franchetti, S. Kral, J. Lorenz, and C. W. Ueberhuber. Aurora report series on simd vec-torization techniques for straight line code. Aurora Technical Report TR2003-01, TR2003-10, TR2003-11, TR2003-12, Institute for Applied Mathematics and Numerical Analysis, Vi-enna University of Technology, 2003.

[50] F. Franchetti and M Puschel. A SIMD vectorizing compiler for digital signal processingalgorithms. In Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS),pages 20–26, 2002.

[51] F. Franchetti and M Puschel. Short vector code generation for the discrete Fourier transform.In Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), pages 58–67,2003.

[52] F. Franchetti, Y. Voronenko, and M. Puschel. Loop merging for signal transforms. In Proc.Programming Language Design and Implementation (PLDI), pages 315–326, 2005.

[53] F. Franchetti, Y. Voronenko, and M. Puschel. FFT program generation for shared memory:SMP and multicore. In Proc. Supercomputing, 2006.

[54] F. Franchetti, Y. Voronenko, and M. Puschel. A rewriting system for the vectorization ofsignal transforms. In Proc. High Performance Computing for Computational Science (VEC-PAR), 2006.

[55] Franz Franchetti, 2005. Personal communication.

[56] Franz Franchetti, Stefan Kral, Juergen Lorenz, and Christoph Ueberhuber. Efficient utiliza-tion of SIMD extensions. Proceedings of the IEEE, 93(2), 2005. special issue on ”ProgramGeneration, Optimization, and Adaptation”.

BIBLIOGRAPHY 187

[57] Franz Franchetti and Markus Puschel. Program generation for parallel platforms, 2007. NSFproposal and grant 0702386.

[58] Franz Franchetti, Yevgen Voronenko, Peter A. Milder, Srinivas Chellappa, Marek Telgar-sky, Hao Shen, Paolo D’Alberto, Frederic de Mesmay, James C. Hoe, Jose M. F. Moura,and Markus Puschel. Domain-specific library generation for parallel software and hardwareplatforms. In NSF Next Generation Software Program Workshop (NSFNGS) colocated withIPDPS, 2008.

[59] M. Frigo. A fast Fourier transform compiler. In Proc. ACM SIGPLAN conference on Pro-gramming Language Design and Implementation (PLDI), pages 169–180, 1999.

[60] M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. InProc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages1381–1384, 1998.

[61] M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings ofthe IEEE, 93(2):216–231, 2005. special issue on ”Program Generation, Optimization, andAdaptation”.

[62] Aca Gacic. Automatic Implementation and Platform Adaptation of Discrete Filtering andWavelet Algorithms. PhD thesis, Carnegie Mellon University, 2004.

[63] Neal Glew. Object closure conversion. Electronic Notes in Theoretical Computer Science, 26,1999.

[64] I. J. Good. The interaction algorithm and practical Fourier analysis. Journal RoyalStatist. Soc., B20:361–375, 1958.

[65] K. J. Gough. Little language processing, an alternative to courses on compiler construction.SIGCSE Bulletin, 13(3):31–34, 1981.

[66] Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMDdistributed-memory machines. Commun. ACM, 35(8):66–80, 1992.

[67] Paul Hudak. Domain specific languages. Available from author on request, 1997.

[68] E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrixkernels. Int’l Journal of High Performance Computing Applications, 18(1), 2004.

[69] R. Janka, R. Judd, J. Lebak, M. Richards, and D. Campbell. VSIPL: an object-based openstandard API for vector, signal, and image processing. In Proc. ICASSP, volume 2, pages949–952, 2001.

[70] J. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing,modifying, and implementing Fourier transform algorithms on various architectures. IEEETrans. Circuits and Systems, 9:449–500, 1990.

[71] Thomas Johnsson. Lambda lifting: transforming programs to recursive equations. In Proc.Conference on Functional programming languages and computer architecture, pages 190–203,New York, NY, USA, 1985. Springer-Verlag.

188 BIBLIOGRAPHY

[72] Ken Kennedy and Kathryn S. McKinley. Maximizing loop parallelism and improving datalocality via loop fusion and distribution. In 1993 Workshop on Languages and Compilersfor Parallel Computing, number 768, pages 301–320, Portland, Ore., 1993. Berlin: SpringerVerlag.

[73] Oleg Kiselyov. A USENET article that discusses implementation of objects as functions(closures) in a non-pure and pure functional languages. Online: http://okmij.org/ftp/

Scheme/oop-in-fp.txt.

[74] S. Kral, F. Franchetti, J. Lorenz, and C. W. Ueberhuber. SIMD vectorization of straightline FFT code. In Proceedings of the EuroPar ’03 Conference on Parallel and DistributedComputing LNCS 2790, pages 251–260, 2003.

[75] Andreas Krall and Sylvain Lelait. Compilation techniques for multimedia processors. Inter-national Journal of Parallel Programming, 28(4):347–361, 2000.

[76] Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multi-media instruction sets. In Proc. ACM PLDI, 2000.

[77] Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. Increasing and detectingmemory address congruence. In Proc. International Conference on Parallel Architectures andCompilation Techniques (PACT), pages 18–29, 2002.

[78] James Lebak, Jeremy Kepner, Henry Hoffmann, and Edward Rutledge. Parallel VSIPL++:An open standard software library for high-performance parallel signal processing. Proceed-ings of the IEEE, 93(2), 2005. special issue on ”Program Generation, Optimization, andAdaptation”.

[79] Xiaoming Li, Marıa J. Garzaran, and David Padua. A dynamically tuned sorting library. InProc. International Symposium on Code Generation and Optimization (CGO), pages 111–124,2004.

[80] Xiaoming Li, Marıa J. Garzaran, and David Padua. Optimizing sorting with genetic algo-rithm. In International Symposium on Code Generation and Optimization (CGO), pages99–110, 2005.

[81] J. Makhoul. A fast cosine transform in one and two dimensions. IEEE Trans. on Acoustics,Speech, and Signal Processing, ASSP-28(1):27–34, 1980.

[82] Henrique S. Malvar. Signal Processing with Lapped Transforms. Artech House Publishers,1992.

[83] Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Puschel. Formal datapathrepresentation and manipulation for implementing DSP transforms. In Design AutomationConference (DAC), 2008.

[84] D. Mirkovic and S. L. Johnsson. Automatic performance tuning in the UHFFT library.In Proc. Int’l Conf. Computational Science (ICCS), volume 2073 of LNCS, pages 71–80.Springer, 2001.

BIBLIOGRAPHY 189

[85] Dorit Naishlos. Autovectorization in GCC. In Proc. GCC Developers Summit, June 2004.Online: ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf.

[86] Dorit Naishlos, Marina Biberstein, Shay Ben-David, and Ayal Zaks. Vectorizing for a SIMdDDSP architecture. In Proc. International Conference on Compilers, architecture and synthesisfor embedded systems (CASES), pages 2–11, 2003.

[87] A. Norton and A. J. Silberger. Parallelization and performance analysis of the Cooley-TukeyFFT algorithm for shared-memory architectures. IEEE Trans. Comput., 36(5):581–591, 1987.

[88] Visual Numerics. JMSL numerical library for Java applications, April 2008. http://www.

vni.com/.

[89] H. J. Nussbaumer. Fast Fourier Transformation and Convolution Algorithms. Springer, 2ndedition, 1982.

[90] Dorit Nuzman and Richard Henderson. Multi-platform auto-vectorization. In Proc. Interna-tional Symposium on Code Generation and Optimization (CGO), pages 281–294, 2006.

[91] Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for SIMD.In Proc. Programming Language Design and Implementation (PLDI), pages 132–143, 2006.

[92] Takuya Ooura. General purpose FFT (fast Fourier/cosine/sine transform) package, April2008. http://www.kurims.kyoto-u.ac.jp/∼ooura/.

[93] Alan V. Oppenheim and Ronald W. Schafer. Discrete-Time Signal Processing. Prentice-Hall,2nd edition, 1999.

[94] W. H. Press, B. P. Flannery, Teukolsky S. A., and Vetterling W. T. Numerical Recipes in C:The Art of Scientific Computing. Cambridge University Press, 2nd edition, 1992.

[95] William Pugh and David Wonnacott. Constraint-based array dependence analysis. ACMTrans. Progr. Lang. Syst., 20(3):635–678, 1998.

[96] M. Puschel and J. M. F. Moura. Algebraic signal processing theory: 1-D space. IEEETransactions on Signal Processing, 2008. to appear.

[97] M. Puschel and J. M. F. Moura. Algebraic signal processing theory: Cooley-Tukey typealgorithms for DCTs and DSTs. IEEE Transactions on Signal Processing, 56(4):1502–1521,2008. a longer version is available at http://arxiv.org/abs/cs.IT/0702025.

[98] M. Puschel and J. M. F. Moura. Algebraic signal processing theory: Foundation and 1-Dtime. IEEE Transactions on Signal Processing, 2008. to appear.

[99] M. Puschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong,F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL:Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232–275, 2005. specialissue on ”Program Generation, Optimization, and Adaptation”.

[100] M. Puschel, B. Singer, M. Veloso, and J. M. F. Moura. Fast automatic generation of DSPalgorithms. In Proc. Int’l Conf. Computational Science (ICCS), volume 2073 of LNCS, pages97–106. Springer, 2001.

190 BIBLIOGRAPHY

[101] Markus Puschel, Peter A. Milder, and James C. Hoe. Permuting streaming data using rams.submitted for publication.

[102] C. M. Rader. Discrete Fourier transforms when the number of data samples is prime. Pro-ceedings of the IEEE, 56:1107–1108, 1968.

[103] K. R. Rao and P. Yip. Discrete Cosine Transform: Algorithms, Advantages, Applications.Academic Press, 1990.

[104] Gang Ren, Peng Wu, and David Padua. An empirical study on the vectorization of multi-media applications for multimedia extensions. In Proc. IEEE Int’l Parallel and DistributedProcessing Symposium (IPDPS), 2005.

[105] G. E. Revesz. Introduction to Formal Languages. McGraw-Hill, 1983.

[106] Paul N. Schwarztrauber. Multiprocessor FFTs. Parallel Computing, 5:197–210, 1987.

[107] Jaewook Shin, Mary Hall, and Jacqueline Chame. Superword-level parallelism in the presenceof control flow. In Proc. International Symposium on Code Generation and Optimization(CGO), pages 165–175, 2005.

[108] B. Singer and M. Veloso. Stochastic search for signal processing algorithm optimization. InProc. Supercomputing, 2001.

[109] B. Singer and M. M. Veloso. Automating the modeling and optimization of the performanceof signal transforms. IEEE Trans. Signal Processing, 50(8):2003–2014, 2002.

[110] B. Singer and M. M. Veloso. Learning to Construct Fast Signal Processing Implementations.Journal of Machine Learning Research, 3:887–919, 2002.

[111] D. R. Smith. Mechanizing the development of software. In M. Broy, editor, CalculationalSystem Design, Proc. of the International Summer School Marktoberdorf. NATO ASI Series,IOS Press, 1999. Kestrel Institute Technical Report KES.U.99.1.

[112] H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus. Real-valued fast Fouriertransform algorithms. IEEE Trans. on Acoustics, Speech, and Signal Processing, ASSP-35(6):849–863, 1987.

[113] Gilbert Strang and Trong Nguyen. Wavelets and Filter Banks. Wellesley-Cambridge Press,1998.

[114] P. N. Swarztrauber. Vectorizing the fast Fourier transforms. Parallel Computations, pages51–83, 1982.

[115] P. N. Swarztrauber. Fast Fourier transform algorithms for vector computers. Parallel Com-puting, pages 45–63, 1984.

[116] P. N. Swarztrauber. FFTPACK, 2006. www.netlib.org/fftpack/.

[117] Daisuke Takahashi. A blocking algorithm for parallel 1-D FFT on shared-memory parallelcomputers. Lecture Notes in Computer Science, 2367:380–389, 2002.

BIBLIOGRAPHY 191

[118] Daisuke Takahashi, Mitsuhisa Sato, and Taisuke Boku. An OpenMP implementation ofparallel FFT and its performance on IA-64 processors. Lecture Notes in Computer Science,2716:99–108, 2003.

[119] Ping Tak Peter Tang. DFTI — a new interface for fast Fourier transform libraries. ACMTrans. Math. Softw., 31(4):475–507, 2005.

[120] Simon Thompson. Haskell: The Craft of Functional Programming. Addison-Wesley, 2ndedition, 1999.

[121] R. Tolimieri, M. An, and C. Lu. Algorithms for Discrete Fourier Transforms and Convolution.Springer, 2nd edition, 1997.

[122] C. Van Loan. Computational Framework of the Fast Fourier Transform. SIAM, 1992.

[123] M. Vetterli and H.J. Nussbaumer. Simple FFT and DCT Algorithms with reduced Numberof Operations. Signal Processing, 6:267–278, 1984.

[124] Martin Vetterli and Jelena Kovacevic. Wavelets and Subband Coding. Prentice-Hall, 1995.

[125] Yevgen Voronenko and Markus Puschel. Algebraic signal processing theory: Cooley-Tukeytype algorithms for real DFTs. IEEE Transactions on Signal Processing, 2008. accepted forpublication.

[126] Z. Wang. Fast algorithms for the discrete W transform and for the discrete Fourier transform.IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-32(4):803–816, 1984.

[127] Z. Wang. On computing the discrete Fourier and cosine transforms. IEEE Trans. on Acous-tics, Speech, and Signal Processing, ASSP-33(4):1341–1344, 1985.

[128] Piotr Wendykier. JTransforms 1.2, April 2008. http://piotr.wendykier.googlepages.

com/jtransforms.

[129] Piotr Wendykier and James G. Nagy. Large-scale image deblurring in Java. In Proc. Int.Conf. on Computational Science, June 2008.

[130] R. C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). InProc. Supercomputing, 1998. math-atlas.sourceforge.net.

[131] R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimization of softwareand the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.

[132] Wikipedia. Closure (computer science), 2008. [Online; accessed 16-April-2008].

[133] Wikipedia. Closure (mathematics), 2008. [Online; accessed 16-April-2008].

[134] Michael Wolfe. High performance compilers for parallel computing. Addison-Wesley, RedwoodCity, CA, 1996.

[135] Peng Wu, Alexandre E. Eichenberger, and Amy Wang. Efficient SIMD code generationfor runtime alignment and length conversion. In Proc. International Symposium on CodeGeneration and Optimization (CGO), pages 153–164, 2005.

192 BIBLIOGRAPHY

[136] Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. An integrated simdiza-tion framework using virtual vectors. In Proc. 19th Annual International Conference onSupercomputing (ICS), pages 169–178, 2005.

[137] J. Xiong, J. Johnson, R. Johnson, and D. Padua. SPL: A language and compiler for DSPalgorithms. In Proc. PLDI, pages 298–308, 2001.

[138] Field Van Zee, Paolo Bientinesi, Tze Meng Low, and Robert van de Geijn. Scalable paral-lelization of FLAME code via the workqueuing model. ACM Transactions on MathematicalSoftware, 34(2), June 2008.

[139] Hans Zima and Barbara Chapman. Supercompilers for parallel and vector computers. ACMPress, New York, 1990.

library generation for linear transforms

Documents