ffts, portability, & performance · ffts, portability, & performance steven g. johnson, mit...
TRANSCRIPT
![Page 1: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/1.jpg)
FFTs, Portability, & Performance
Steven G. Johnson, MIT Dept. PhysicsMatteo Frigo, ITA Software (formerly MIT LCS)
![Page 2: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/2.jpg)
A Need for Speed?
Scientists(along with gamers)
often pushperformance limits
low-level programming?
Codes havelong lifetimes,
and needflexibility & portability
high-level programming?
Perhaps there is a better way?
![Page 3: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/3.jpg)
FFTW
• C library for real & complex FFTs (arbitrary size/dimensionality)
• Computational kernels (80% of code) automatically generated
• Self-optimizes for your hardware: portability + performance
(+ parallel versions for threads & MPI)
free software: http://www.fftw.org/
![Page 4: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/4.jpg)
?The “Fastest Fourier Transform in the West”
no code is always fastest, but…
![Page 5: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/5.jpg)
FFTW on 167MHz UltraSPARCdouble precision, complex 1d transforms
But this is OLD!
![Page 6: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/6.jpg)
okay, I’ll present some new stuff…
FFTW 3.0(soon to be released)
![Page 7: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/7.jpg)
FFTW on 2GHz Pentium IV
FFTW 3
FFTW 2
![Page 8: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/8.jpg)
FFTW on 2GHz Pentium IV
FFTW 3
FFTW 2
![Page 9: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/9.jpg)
FFTW on 1GHz Alpha (EV7)
FFTW 3FFTW 2
![Page 10: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/10.jpg)
Why is FFTW fast?
FFTW implements many FFT algorithms:A planner picks the best composition
by measuring the speed of different combinations.1
The resulting plan is executedwith explicit recursion:
enhances locality2
3 The base cases of the recursion are codelets:highly-optimized dense code that is
automatically generated by a special-purpose “compiler”
![Page 11: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/11.jpg)
FFTW is easy to use{
complex x[n];plan p;
p = plan_dft_1d(n, x, x, FOR WARD, MEASURE);...
execute(p); /* repeat as needed */...destroy_plan(p);
}
Key fact: usually,many transforms of same size
are required.
![Page 12: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/12.jpg)
Outline
FFT algorithm basics
Recursion and caches
The planner
The codelet generator
![Page 13: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/13.jpg)
Outline
FFT algorithm basics
Recursion and caches
The planner
The codelet generator
![Page 14: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/14.jpg)
Cooley-Tukey FFTs[ 1965 … or Gauss, 1802 ]
n = pq1d DFT of size n:
= ~2d DFT of size p x q
multiply by n “twiddle factors”
q
p
transpose
p
q
= contiguousfirst DFT columns, size q
(non-contiguous) finally, DFT columns, size p
(non-contiguous)
![Page 15: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/15.jpg)
Cooley-Tukey FFTs[ 1965 … or Gauss, 1802 ]
n = pq1d DFT of size n:
= ~2d DFT of size p x q
= Recursive DFTs of sizes p and q
O(n2) O(n log n)(divide-and-conquer algorithm)
![Page 16: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/16.jpg)
Cooley-Tukey FFTs[ 1965 … or Gauss, 1802 ]
twiddlessize-q DFTssize-p DFTs
![Page 17: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/17.jpg)
Outline
FFT algorithm basics
Recursion and caches
The planner
The codelet generator
![Page 18: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/18.jpg)
Cooley-Tukey is Naturally Recursive
But traditional implementation is non-recursive,breadth-first traversal:
log2 n passes over whole array
Size 8 DFT
Size 4 DFT Size 4 DFT
Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT
p = 2 (radix 2)
![Page 19: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/19.jpg)
Recursive Divide & Conquer is Good
Size 8 DFT
Size 4 DFT Size 4 DFT
Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT
p = 2 (radix 2)
eventually small enough to fit in cache…no matter what size the cache is
![Page 20: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/20.jpg)
Out-of-cache FFTs: “Blocking”
![Page 21: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/21.jpg)
Cache-oblivious Recursive FFT
[ Vitter & Shriver, 1994 ]
![Page 22: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/22.jpg)
Cache Obliviousness• A cache-oblivious algorithm does not know the cache size
— it can be optimal for any machine& for all levels of cache simultaneously
• They exist for matrix multiplication, LU decomposition, sorting, transposition, binary search trees, etc. [Frigo et al. 1999]
— all via the recursive divide & conquer approach
FFTW uses a finite-radix (p) recursive cache-oblivious algorithm with suboptimal “cache complexity” O(n log[n/C]),
…but an optimal algorithm is used in the generator (cache == registers)
![Page 23: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/23.jpg)
Outline
FFT algorithm basics
Recursion and caches
The planner
The codelet generator
![Page 24: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/24.jpg)
The Planner
• There are many choices in implementing the C-T algorithm— which factor p? & memory access ordering…
Each algorithm step is represented by a solver.
• The planner tries the different solver combinations for a given n,
measures their speed, and picks the fastest.
— uses dynamic programming
— can use heuristics or saved plansif planning time is a concern
![Page 25: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/25.jpg)
Vectors and Solversa problem is specified as a DFT(v,n):
multi-dimensional transform n of multi-dimensional vectors v
SOLVE[v,n] Directly solve size n with 1d vector (loop) v
by an efficient codelet (hard-coded FFT loop)
CT-FACTOR[p]DFT(v, n = pq) =
DFT(vxp, q)
+ v size-p DFTs with twiddles[loop v of hard-coded twiddle codelet p]
VECLOOP DFT(vxm, n) = loop m of DFT(v,n)
each solver knows what problems it can solveand tells the planner its recursive “child” problems
![Page 26: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/26.jpg)
Dynamic Programmingthe assumption of “optimal substructure”
DFT(16) = fastest of: CT-FACTOR[2]: 2 DFT(8)CT-FACTOR[4]: 4 DFT(4)
DFT(8) = fastest of:CT-FACTOR[2]: 2 DFT(4)CT-FACTOR[4]: 4 DFT(2)SOLVE[1,8]
Try all applicable solvers:assume VECLOOP
strips off loops
If exactly the same problem appears twice,assume that we can re-use the plan.
![Page 27: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/27.jpg)
More Solvers (out of ~16 total)
(a) DFT(vxm, n) = loop m of DFT(v,n)
(b) DFT(mxv, n) = loop m of DFT(v,n)ORVECLOOP
i.e. interchange loop orders!
INDIRECT DFT(v,n) = DFT(v,{}) + DFT(v,n)
zero-dimensional DFT = copy loop in-place
![Page 28: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/28.jpg)
Actual Plan for size 219=524288(2GHz Pentium IV, double precision, out-of-place)
CT-FACTOR[4] (buffered variant)CT-FACTOR[32] (buffered variant)
VECLOOP(b) x32CT-FACTOR[64]
INDIRECT
VECLOOP(a) x4SOLVE[64, 64]
VECLOOP(b) x64VECLOOP(a) x4
COPY[64]
~2000 lineshard-coded C!INDIRECT
+VECLOOP(b)
(+ …)=
demolishes FFTW 2for large 1d sizes
Unpredictable: (automated) experimentation is the only solution.
![Page 29: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/29.jpg)
Outline
FFT algorithm basics
Recursion and caches
The planner
The codelet generator
![Page 30: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/30.jpg)
The Codelet Generatora domain-specific FFT “compiler”
• Generates fast hard-coded C for FFTs of arbitrary size
Necessary to give the planner a large space of codelets to
experiment with.
Exploits modern CPUdeep pipelines & large register sets.
Allows easy experimentation with different optimizations & algorithms.
…and you only have to get it right once.
![Page 31: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/31.jpg)
The Codelet Generatorwritten in Objective Caml [Leroy, 1998], an ML dialect
Symbolic graph (dag)
Simplifications
Cache-oblivious scheduling(cache .EQ. registers)
Optimized C code (or other language)
n
powerful enoughto e.g. derive real-input FFTfrom complex FFT algorithmand even find new algorithms
Abstract FFT algorithmCooley-Tukey: n=pq,
Prime-Factor: gcd(p,q) = 1,Rader: n prime, …
![Page 32: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/32.jpg)
The Generator Finds Good/New FFTs
![Page 33: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/33.jpg)
Symbolic Algorithms are EasyCooley-Tukey in OCaml
![Page 34: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/34.jpg)
Simple Simplifications
Well-known optimizations:
Algebraic simplification, e.g. a + 0 = a
Constant folding
Common-subexpression elimination
![Page 35: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/35.jpg)
Symbolic Pattern Matching in OCamlThe following actual code fragment issolely responsible for simplifying multiplications:
stimesM = function| (Uminus a, b) -> stimesM (a, b) >>= suminusM| (a, Uminus b) -> stimesM (a, b) >>= suminusM| (Num a, Num b) -> snumM (Number.mul a b)| (Num a, Times (Num b, c)) ->
snumM (Number.mul a b) >>= fun x -> stimesM (x, c)| (Num a, b) when Number.is_zero a -> snumM Number.zero| (Num a, b) when Number.is_one a -> makeNode b| (Num a, b) when Number.is_mone a -> suminusM b| (a, b) when is_known_constant b && not (is_known_constant a) ->
stimesM (b, a)| (a, b) -> makeNode (Times (a, b))
(Common-subexpression elimination is implicitvia “memoization” and monadic programming style.)
![Page 36: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/36.jpg)
Simple Simplifications
Well-known optimizations:
Algebraic simplification, e.g. a + 0 = a
Constant folding
Common-subexpression elimination
FFT-specific optimizations:
_________________ negative constants…
Network transposition (transpose + simplify + transpose)
![Page 37: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/37.jpg)
A Quiz: Is One Faster?Both compute the same thing, and
have the same number of arithmetic operations:
a = 0.5 * b;c = 0.5* d;e = 1.0 + a;f = 1.0 -c;
Faster because no separate load for -0.5
a = 0.5 * b;c = -0.5 * d;e = 1.0 + a;f = 1.0 + c;
10–15% speedup
![Page 38: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/38.jpg)
Non-obvious transformations require experimentation
![Page 39: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/39.jpg)
Quiz 2: Which is Faster?accessing strided array
inside codelet (amid dense numeric code)
array[stride * i] array[strides[i]]
strides[i] = stride * i
using precomputed stride array:
This is faster, of course!Except on brain-dead architectures…
…namely, Intel Pentia:integer multiplication
conflicts with floating-point
up to ~20% speedup
![Page 40: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/40.jpg)
SIMD: The Revenge of the Crays= Single Instruction, Multiple Data
Available on most popular processors today:
Pentium III+ SSE: operate on 4 floatvaluesPowerPC G4 AltiVec: operate on 4 floatvalues
AMD Athlon 3dNow!: operate on 2 floatvalues
Pentium IV SSE2: operate on 2 double values
Modify only the generator to produce SIMD codelets[ initiated by S. Kral and F. Franchetti, Univ. Vienna ]
![Page 41: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/41.jpg)
SSE2 FFTW on 2GHz Pentium IV
SSE2 FFTW 3
FFTW 3IntelMKL
![Page 42: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/42.jpg)
SSE FFTW on 2GHz Pentium IV
SSE FFTW 3
FFTW 3
Intel MKL
![Page 43: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/43.jpg)
with a generator,it’s easy to include
less-popular cases…
![Page 44: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/44.jpg)
SSE2 FFTW on 2GHz Pentium IV
SSE2 FFTW 3
FFTW 3
![Page 45: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/45.jpg)
We’ve Come a Long Way
1965 Cooley & Tukey, IBM 7094, 36-bit single precision:size 2048 DFT in 1.2 seconds
2003 FFTW3+SIMD, 2GHz Pentium-IV 64-bit double precision:size 2048 DFT in 50 microseconds (24,000x speedup)
(= 30% improvement per year)
![Page 46: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/46.jpg)
We’ve Come a Long Way?In the name of performance,computers have become complex and
•
unpredictable.Optimization is hard:you cannot simply minimize the number of operations.
•
The solution is to avoid the details, not embrace them:(Recursive) composition of simple modules
+ feedback (self-optimization)High-level languages (not C) & code generationare a powerful tool for high performance.
•
![Page 47: FFTs, Portability, & Performance · FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS) A Need for Speed? Scientists](https://reader034.vdocuments.net/reader034/viewer/2022042310/5ed83e6f0fa3e705ec0e1ba6/html5/thumbnails/47.jpg)
FFTW Homework Problems?• Try an FFTPACK-style back-and-forth solver
• Implement Vector-Radix for multi-dimensional n
• Pruned FFTs: VECLOOP that skips zeros
• Better heuristic planner—some sort of optimization of per-solver “costs?”
• Modify generator for fixed-point arithmetic—e.g. faster integer MDCT for Ogg Vorbis audio
• Implement convolution solvers