large multicore ffts: approaches to optimization
DESCRIPTION
Large Multicore FFTs: Approaches to Optimization. Sharon Sacco and James Geraci 24 September 2008. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/1.jpg)
HPEC 2008-1SMHS 9/24/2008
MIT Lincoln Laboratory
Large Multicore FFTs: Approaches to Optimization
Sharon Sacco and James Geraci
24 September 2008
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government
![Page 2: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/2.jpg)
MIT Lincoln LaboratoryHPEC 2008-2
SMHS 9/24/2008
• 1D Fourier Transform• Mapping 1D FFTs onto Cell• 1D as 2D Traditional Approach
Outline
• Introduction
• Technical Challenges
• Design
• Performance
• Summary
![Page 3: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/3.jpg)
MIT Lincoln LaboratoryHPEC 2008-3
SMHS 9/24/2008
1D Fourier Transform
gj = fk e-2ijk/Nk = 0
N-1
• This is a simple equation• A few people spend a lot of their careers trying to make it
run fast
![Page 4: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/4.jpg)
MIT Lincoln LaboratoryHPEC 2008-4
SMHS 9/24/2008
Mapping 1D FFT onto Cell
Element Interconnect Bus (EIB)
SPE
SPE SPESPESPE
SPESPESPE
PPE
BEI
MIC
FlexIO
XDR Memory
• Small FFTs can fit into a single LS memory. 4096 is the largest size.
• Large FFTs must use XDR memory as well as LS memory.
FFT Data
Element Interconnect Bus (EIB)
SPE
SPE SPESPESPE
SPESPESPE
PPE
BEI
MIC
FlexIO
XDR Memory
• Cell FFTs can be classified by memory requirements
• Medium and large FFTs require careful memory transfers
• Medium FFTs can fit into multiple LS memory. 65536 is the largest size.
Element Interconnect Bus (EIB)
SPE
SPE SPESPESPE
SPESPESPE
PPE
BEI
MIC
FlexIO
XDR Memory
![Page 5: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/5.jpg)
MIT Lincoln LaboratoryHPEC 2008-5
SMHS 9/24/2008
1D as 2D Traditional Approach
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
w0 w0 w0 w0
w0 w1 w2 w3
w0 w2 w4 w6
w0 w3 w6 w9
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
1. Corner turn to
compact columns
2. FFT on columns 3. Corner turn to original
orientation
4. Multiply (elementwise)
by central twiddles
5. FFT on rows
6. Corner turn to
correct data order
• 1D as 2D FFT reorganizes data a lot– Timing jumps when used
• Can reduce memory for twiddle tables• Only one FFT needed
![Page 6: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/6.jpg)
MIT Lincoln LaboratoryHPEC 2008-6
SMHS 9/24/2008
• Communications• Memory • Cell Rounding
Outline
• Introduction
• Technical Challenges
• Design
• Performance
• Summary
![Page 7: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/7.jpg)
MIT Lincoln LaboratoryHPEC 2008-7
SMHS 9/24/2008
Communications
Element Interconnect Bus (EIB)
SPE
SPE SPESPESPE
SPESPESPE
PPE
BEI
MIC
FlexIO
XDR Memory
Bandwidth to XDR memory
25.3 GB/s
SPE connection to EIB is 50 GB/s
• Minimizing XDR memory accesses is critical• Leverage EIB • Coordinating SPE communication is desirable
– Need to know SPE relative geometry
EIB bandwidth is 96 bytes / cycle
![Page 8: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/8.jpg)
MIT Lincoln LaboratoryHPEC 2008-8
SMHS 9/24/2008
Memory
Element Interconnect Bus (EIB)
SPE
SPE SPESPESPE
SPESPESPE
PPE
BEI
MIC
FlexIO
XDR Memory
XDR Memory is much larger than 1M pt FFT
requirements
Each SPE has 256 KB local store memory
Each Cell has 2 MB local store memory
total • Need to rethink algorithms to leverage the memory
– Consider local store both from individual and collective SPE point of view
![Page 9: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/9.jpg)
MIT Lincoln LaboratoryHPEC 2008-9
SMHS 9/24/2008
Cell Rounding
• The cost to correct basic binary operations, add, multiply, and subtract, is prohibitive
• Accuracy should be improved by minimizing steps to produce a result in algorithm
IEEE 754 Round to Nearest Cell (truncation)
b00 b00b01 b10 b01 b10
1 bit
• Average value – x01 + 0 bits • Average value – x01 + .5 bit
![Page 10: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/10.jpg)
MIT Lincoln LaboratoryHPEC 2008-10
SMHS 9/24/2008
• Using Memory Well• Reducing Memory Accesses• Distributing on SPEs• Bit Reversal• Complex Format
• Computational Considerations
Outline
• Introduction
• Technical Challenges
• Design
• Performance
• Summary
![Page 11: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/11.jpg)
MIT Lincoln LaboratoryHPEC 2008-11
SMHS 9/24/2008
FFT Signal Flow Diagram and Terminology
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
0
4
12
2
10
14
6
1
9
5
13
3
11
7
15
• Size 16 can illustrate concepts for large FFTs– Ideas scale well and it is “drawable”
• This is the “decimation in frequency” data flow• Where the weights are applied determines the algorithm
butterfly
block
radix 2 stage
![Page 12: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/12.jpg)
MIT Lincoln LaboratoryHPEC 2008-12
SMHS 9/24/2008
Reducing Memory Accesses
• Columns will be loaded in strips that fit in the total Cell local store
• FFT algorithm processes 4 columns at a time to leverage SIMD registers
• Requires separate code from row FFTS
• Data reorganization requires SPE to SPE DMAs
• No bit reversal
1024
1024
64
4
![Page 13: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/13.jpg)
MIT Lincoln LaboratoryHPEC 2008-13
SMHS 9/24/2008
1D FFT Distribution with Single Reorganization
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
0
4
12
2
10
14
6
1
9
5
13
3
11
7
15reorganize
• One approach is to load everything onto a single SPE to do the first part of the computation
• After a single reorganization each SPE owns an entire block and can complete the computations on its points
![Page 14: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/14.jpg)
MIT Lincoln LaboratoryHPEC 2008-14
SMHS 9/24/2008
1D FFT Distribution with Multiple Reorganizations
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
0
4
12
2
10
14
6
1
9
5
13
3
11
7
15reorganize
• A second approach is to divide groups of contiguous butterflies among SPEs and reorganize after each stage until the SPEs own a full block
reorganize
![Page 15: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/15.jpg)
MIT Lincoln LaboratoryHPEC 2008-15
SMHS 9/24/2008
Selecting the Preferred Reorganization
Number of SPEs
Number of Exchanges
Data Moved in 1 DMA
Number of Exchanges
Data Moved in 1 DMA
2 2 N / 4 2 N / 44 12 N / 16 8 N / 88 56 N / 64 24 N / 16
Single Reorganization Multiple Reorganizations
• Evaluation favors multiple reorganizations– Fewer DMAs have less bus contention
Single Reorganization exceeds the number of busses– DMA overhead (~ .3s) is minimized– Programming is simpler for multiple reorganizations
N - the number of elements in SPE memory, P - number of SPEs
• Number of exchangesP * log2 (P)
• Number of elements exchanged
(N / 2) * log2 (P)
• Number of exchangesP * (P – 1)
• Number of elements exchanged
N * (P – 1) / P
Typical N is 32k complex elements
![Page 16: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/16.jpg)
MIT Lincoln LaboratoryHPEC 2008-16
SMHS 9/24/2008
Column Bit Reversal
• Bit reversal of columns can be implemented by the order of processing rows and double buffering
• Reversal row pairs are both read into local store and then written to each others memory location
000000001
100000000
Binary Row Numbers
• Exchanging rows for bit reversal has a low cost • DMA addresses are table driven• Bit reversal table can be very small • Row FFTs are conventional 1D FFTs
![Page 17: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/17.jpg)
MIT Lincoln LaboratoryHPEC 2008-17
SMHS 9/24/2008
Complex Format
• Interleaved complex format reduces number of DMAs
• Two common formats for complex– interleaved– split
real 1real 0
imag 0 imag 1
real 0 real 1 imag 1imag 0
• Complex format for user should be standard
• Internal format conversion is light weight
• Internal format should benefit the algorithm
– Internal format is opaque to user
• SIMD units need split format for complex arithmetic
![Page 18: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/18.jpg)
MIT Lincoln LaboratoryHPEC 2008-18
SMHS 9/24/2008
• Using Memory Well• Computational Considerations
•Central Twiddles•Algorithm Choice
Outline
• Introduction
• Technical Challenges
• Design
• Performance
• Summary
![Page 19: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/19.jpg)
MIT Lincoln LaboratoryHPEC 2008-19
SMHS 9/24/2008
Central Twiddles
• Central twiddles can take as much memory as the input data
• Reading from memory could increase FFT time up to 20%
• For 32-bit FFTs central twiddles can be computed as needed
– Trigonometric identity methods require double precision
Next generation Cell should make this the method of choice
– Direct sine and cosine algorithms are long
w0
w0
w0
w1
w0
w2
w0
w2
w0
w4
w3
w3
w0w0
w1023 * 1023
w0
…
…
..
.
• Central twiddles are a significant part of the design
Central Twiddles for 1M FFT
![Page 20: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/20.jpg)
MIT Lincoln LaboratoryHPEC 2008-20
SMHS 9/24/2008
Algorithm Choice
• Cooley-Tukey has a constant operation count– 2 muladd to compute each result for each stage
• Gentleman-Sande varies widely in operation count– 1 – 3 operations for each result
• DC term has the same accuracy on both• Gentleman-Sande worst term has 50% more roundoff error
when fused multiply-add is available
Cooley-Tukey
-wk
wka
b
a + b * wk
a - b * wk
Gentleman-Sande
-wk
wk
a
b
a + b
(a – b) * wk
Computational Butterflies
![Page 21: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/21.jpg)
MIT Lincoln LaboratoryHPEC 2008-21
SMHS 9/24/2008
Radix Choice
• What counts for accuracy is how many operations from the input to a particular result
Cooley Tukey Radix 4t1 = xl * wz
t2 = xk * wz/2
t3 = xm * w3z/2
s0 = xj + t1
a0 = xj – t1
s1 = t2 + t3
a1 = t2 – t3
yj = s0 + s1
yk = s0 – s1
yl = a0 – i * a1
ym = a0 + i * a1
• Number of operations for 1 radix 4 stage (real or imaginary: 9 ( 3 mul, 3 muladd, 3 add)
• Number of operations for 2 radix 2 stages (real or imaginary) : 6 ( 6 muladd)
• Higher radices reuse computations but do not reduce the amount of arithmetic needed for computation
• Fused multiply add instructions are more accurate than multiply followed by add
• Radix 2 will give the best accuracy
![Page 22: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/22.jpg)
MIT Lincoln LaboratoryHPEC 2008-22
SMHS 9/24/2008
Outline
• Introduction
• Technical Challenges
• Design
• Performance
• Summary
![Page 23: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/23.jpg)
MIT Lincoln LaboratoryHPEC 2008-23
SMHS 9/24/2008
Estimating Performance
• Timing estimates typically cannot include all factors– Computation is based on the minimum number of instructions to
estimate– I/O timings are based on bus speed and amount of data– Experience is a guide for the efficiency of I/O and computations
I/O estimate:• Number of byte transfers
to/from XDR memory:33560192 (minimum)
• Bus speed to XDR: 25.3 GHz• Estimated efficiency: 80%• Minimum I/O time:
1.7 ms
Computation estimate:• Number of operations:
104875600• Maximum FLOPS: 205• Estimated efficiency: 85%• Minimum Computation time:
.6 ms (8 SPEs)
1M FFT estimate (without full ordering): 2 ms
![Page 24: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/24.jpg)
MIT Lincoln LaboratoryHPEC 2008-24
SMHS 9/24/2008
Preliminary Timing Results
• Timing results are close to predictions– 4 SPEs about a factor of 4 from prediction– 8 and 16 SPEs closer to prediction
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0 2 4 6 8 10 12 14 16 18
Tim
e (s
econ
ds)
Number of SPEs
1M FFT Timings
• Timings were performed on Mercury CTES
– QS21 @3.2 GHz Dual Cell Blades
![Page 25: Large Multicore FFTs: Approaches to Optimization](https://reader036.vdocuments.net/reader036/viewer/2022081520/56816835550346895dddead1/html5/thumbnails/25.jpg)
MIT Lincoln LaboratoryHPEC 2008-25
SMHS 9/24/2008
Summary
• A good FFT design must consider the hardware features
– Optimize memory accesses– Understand how different algorithms map to the
hardware• Design needs to be flexible in the approach
– “One size fits all” isn’t always the best choice– Size will be a factor
• Estimates of minimum time should be based on the hardware characteristics
• 1M point FFT is difficult to write, but possible