fft convolutions are faster than winograd on modern cpus ... · using the same optimization...

17
FFT Convolutions are Faster than Winograd on Modern CPUs, Here’s Why Aleksandar Zlateski Massachusetts Institute of Technology Zhen Jia Princeton University Kai Li Princeton University Fredo Durand Massachusetts Institute of Technology Abstract Winograd-based convolution has quickly gained traction as a preferred approach to implement convolutional neural networks (ConvNet) on various hardware platforms because it requires fewer floating point operations than FFT-based or direct convolutions. This paper compares three highly optimized implemen- tations (regular FFT–, Gauss–FFT–, and Winograd–based convolutions) on modern multi– and many–core CPUs. Although all three implementations employed the same optimizations for modern CPUs, our experimental results with two popular ConvNets (VGG and AlexNet) show that the FFT–based implementations generally outperform the Winograd–based approach, contrary to the popular belief. To understand the results, we use a Roofline performance model to analyze the three implementations in detail, by looking at each of their computation phases and by consid- ering not only the number of floating point operations, but also the memory bandwidth and the cache sizes. The perfor- mance analysis explains why, and under what conditions, the FFT–based implementations outperform the Winograd– based one, on modern CPUs. 1 Introduction Convolutional neural networks (ConvNets) have emerged as a widespread machine learning method for many applica- tion domains, soon after the demonstration of its superior performance for classification and localization tasks in the ImageNet[12] competition in 2012 [20]. Since convolutional layers are computationally intensive and dominate the total execution time of modern deep ConvNets [20, 25, 28, 29], many efforts have been made to improve the performance of the convolutional primitives for CPUs [1, 10, 30, 36, 38], GPUs [5, 11, 24, 31] or both [37]. An important class of optimization is to reduce the num- ber of calculations required for a convolution. Initially, sev- eral efforts used FFT–based convolutions to reduce the re- quired computations for GPUs [24, 31] and CPUs [36, 37]. Re- cently Lavin et al. [21] proposed a Winograd–based method for performing convolutions, and demonstrated that it can save more floating point operations than FFT, especially for small 2D kernels (e.g. 3 × 3). This prompted a large number Both authors contributed equally to the paper of implementations to employ Winograd–based convolu- tions. For example, Nervana [5] and Nvidia’s cuDNN [11] implemented Winograd–based convolution for GPUs. FAL- CON [1], LIBXSMM[6] and Intel MKL-DNN [2] provided CPU implementations of Winograd-based convolutions. Bud- den et al. [10] extended the 2D Winograd-based convolution to arbitrary dimensions and kernel sizes. A highly optimized implementation for modern many–core CPUs was presented by [18]. However, the community has not put in similar optimizing efforts into FFT–based implementations. This paper presents results of comparing Winograd–based with FFT–based convolutions on modern CPUs. We have extended the highly optimized Winograd–based implemen- tation for Intel Xeon Phi processors [18], to support arbitrary modern multi– and many–core CPUs (that support the AVX2 or the AVX512 instruction set). Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One is based on the regular FFT algorithm (Regular–FFT). Another is also FFT–based but uses Gauss’ multiplication method (Gauss-FFT). All imple- mentations are open–sourced at [8]. We have compared all three implementations with two popular ConvNets (VGG and AlexNet) on 10 different sys- tems with modern CPUs. Our results show that, contrary to the popular belief, the regular–FFT or Gauss–FFT implemen- tations are generally faster than the Winograd implementa- tion. And in some cases, they are substantially faster. To understand the experimental results, we have used the Roofline performance model [33] to analyze each algorithm and its implementation in detail by carefully analyzing its computational phases. Such analysis provided two key ex- planations. First, Winograd–based approach, in most cases, requires fewer floating point operations than FFT–based approach because it works with real numbers instead of complex numbers. However, due to its numerical instability, the Winograd method can use only very small transform sizes [10, 21, 32]. The FFT–based convolutions do not suffer from such instabilities, allowing for arbitrary large tile sizes. Large tile sizes allow the FFT–based approach to reduce a large number of redundant or unnecessary computations. Thus, in certain scenarios, the FFT–based method requires fewer operations than the Winograd–based one. Second, our analysis considers not only the number of floating point operations, but also the total amount of data 1 arXiv:1809.07851v1 [cs.PF] 20 Sep 2018

Upload: others

Post on 29-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

FFT Convolutions are Faster than Winograd onModern CPUs Herersquos Why

Aleksandar ZlateskilowastMassachusetts Institute of Technology

Zhen JialowastPrinceton University

Kai LiPrinceton University

Fredo DurandMassachusetts Institute of Technology

AbstractWinograd-based convolution has quickly gained tractionas a preferred approach to implement convolutional neuralnetworks (ConvNet) on various hardware platforms becauseit requires fewer floating point operations than FFT-basedor direct convolutionsThis paper compares three highly optimized implemen-

tations (regular FFTndash GaussndashFFTndash and Winogradndashbasedconvolutions) on modern multindash and manyndashcore CPUsAlthough all three implementations employed the sameoptimizations for modern CPUs our experimental resultswith two popular ConvNets (VGG and AlexNet) show thatthe FFTndashbased implementations generally outperform theWinogradndashbased approach contrary to the popular belief

To understand the results we use a Roofline performancemodel to analyze the three implementations in detail bylooking at each of their computation phases and by consid-ering not only the number of floating point operations butalso the memory bandwidth and the cache sizes The perfor-mance analysis explains why and under what conditionsthe FFTndashbased implementations outperform the Winogradndashbased one on modern CPUs

1 IntroductionConvolutional neural networks (ConvNets) have emergedas a widespread machine learning method for many applica-tion domains soon after the demonstration of its superiorperformance for classification and localization tasks in theImageNet[12] competition in 2012 [20] Since convolutionallayers are computationally intensive and dominate the totalexecution time of modern deep ConvNets [20 25 28 29]many efforts have been made to improve the performanceof the convolutional primitives for CPUs [1 10 30 36 38]GPUs [5 11 24 31] or both [37]An important class of optimization is to reduce the num-

ber of calculations required for a convolution Initially sev-eral efforts used FFTndashbased convolutions to reduce the re-quired computations for GPUs [24 31] and CPUs [36 37] Re-cently Lavin et al [21] proposed a Winogradndashbased methodfor performing convolutions and demonstrated that it cansave more floating point operations than FFT especially forsmall 2D kernels (eg 3 times 3) This prompted a large number

lowastBoth authors contributed equally to the paper

of implementations to employ Winogradndashbased convolu-tions For example Nervana [5] and Nvidiarsquos cuDNN [11]implemented Winogradndashbased convolution for GPUs FAL-CON [1] LIBXSMM[6] and Intel MKL-DNN [2] providedCPU implementations ofWinograd-based convolutions Bud-den et al [10] extended the 2D Winograd-based convolutionto arbitrary dimensions and kernel sizes A highly optimizedimplementation for modern manyndashcore CPUs was presentedby [18] However the community has not put in similaroptimizing efforts into FFTndashbased implementations

This paper presents results of comparingWinogradndashbasedwith FFTndashbased convolutions on modern CPUs We haveextended the highly optimized Winogradndashbased implemen-tation for Intel Xeon Phi processors [18] to support arbitrarymodern multindash and manyndashcore CPUs (that support the AVX2or the AVX512 instruction set) Using the same optimizationtechniques we additionally implemented two FFTndashbasedalgorithms for modern CPUs One is based on the regularFFT algorithm (RegularndashFFT) Another is also FFTndashbased butuses Gaussrsquo multiplication method (Gauss-FFT) All imple-mentations are openndashsourced at [8]We have compared all three implementations with two

popular ConvNets (VGG and AlexNet) on 10 different sys-tems with modern CPUs Our results show that contrary tothe popular belief the regularndashFFT or GaussndashFFT implemen-tations are generally faster than the Winograd implementa-tion And in some cases they are substantially faster

To understand the experimental results we have used theRoofline performance model [33] to analyze each algorithmand its implementation in detail by carefully analyzing itscomputational phases Such analysis provided two key ex-planations First Winogradndashbased approach in most casesrequires fewer floating point operations than FFTndashbasedapproach because it works with real numbers instead ofcomplex numbers However due to its numerical instabilitythe Winograd method can use only very small transformsizes [10 21 32] The FFTndashbased convolutions do not sufferfrom such instabilities allowing for arbitrary large tile sizesLarge tile sizes allow the FFTndashbased approach to reduce alarge number of redundant or unnecessary computationsThus in certain scenarios the FFTndashbased method requiresfewer operations than the Winogradndashbased oneSecond our analysis considers not only the number of

floating point operations but also the total amount of data

1

arX

iv1

809

0785

1v1

[cs

PF]

20

Sep

2018

movements (to and from memory) and their costs the arith-metic intensity (operations per moved byte) the processorrsquosspeed as well as the memory bandwidth We also analyzethe effects of cache sizes which determine the arithmeticintensity of both methods and indirectly affect the runningtimesLarge arithmetic intensity allows for better utilization of

hardware systems whose computendashtondashmemory ratios areincreasing over time because of the imbalance of the evo-lution between processor speeds and memory bandwidthsThe speeds for computations typically improve much fasterthan memory bandwidths [35] 1 This benefits the FFTndashbasedmethod more as its arithmetic intensity is generally higherdue to computing with complex numbers

Our analysis suggests that whether the Winogradndashbasedor a FFTndashbased approach is faster depend on the specificconvolutional layer and the particular system it is executedon However on average the FFTndashbased approaches outper-form the Winogradndashbased one with commonly used neu-ral networks with the margin increasing as the systemrsquoscomputendashtondashmemory ratio increasesThe findings in this paper challenge the current belief

that Winogradndashbased convolutions are better in nearly allpractical cases

2 Background21 Winogradndash and FFTndash Based Convolutions

Winograd ndash Based Convolution

As recently illustrated by Lavin et al [21] ldquovalidrdquo convo-lution of discrete signals the main operation used in mod-ern convolutional neural networks can be performed usingWinogradrsquos minimal filtering algorithm [34] via

f lowastValid

д = AT [(Gд) ⊙ (BT f )] (1)

Where f and д are 1D signals ⊙ represents elementndashwisemultiplication and AT BT and G are special matrices de-rived from Vandermonde matrices for Homogeneous Coor-dinate polynomials [32] By convention Winograd convolu-tions have thematricesAT BT andG in realndashspace In ldquovalidrdquoconvolution the filter slides across every ldquovalidrdquo location inthe filtered images ndash such that the filter is fully containedinside the image When the size of the filter is |д | = r f lowast

Validд

will have a length of | f | minus |д | + 1 = m and we refer to themethod above as Winograd convolution F(m r )FFT ndash Based Convolution

The convolution theorem states that a convolution can beperformed using Fourier transforms via

f lowastCirc

д = F minus1 (F (f ) middot F (д))

(2)

1For instance the 45 TFLOPS Intel Knights Landing processor [17] hasa computendashtondashmemory ratio of 11 whereas the latest Skylake Xeon andSkylakeX family of processors have reached the ratio in the range between20 to almost 40

Here F and F minus1 are Fourier and inverse Fourier trans-forms In the discrete case f and д need to have the samenumber of elements which can be accomplished by paddingzeros to the shorter signal

Discrete Fourier transform (DFT) results in a circular (alsoknown as cyclic) convolution The result of the ldquovalidrdquo con-volution can be extracted from the last | f | minus |д | + 1 = melements of the circular convolutionldquoValidrdquo convolution using discrete FFTs can also be re-

garded as a special case of the Eq 1 where the matrices AT BT and G are in complex space and are derived from Van-dermonde matrices with polynomial points being the rootsof unity [32] BT and G perform implicitly zerondashpadded (tosize ofm + r minus 1 = | f |) DFT transforms of f and д and ATcomputes the lastm elements of the inverse DFT transformUsing the FFT algorithm allows for efficient computationof matrixndashvector products with matrices AT BT and G Werefer to this special case as RegularndashFFT F(m r )MultindashDimensional Convolution

Both Winograd and FFT convolutions can easily be extendedto an arbitrary number of dimensions [10] Nndashdimensionalconvolution is performed via

f lowastValid

д =[(дtimesNn=1G) ⊙ (f timesNn=1B)

]timesNn=1A

Tn (3)

Here the operation x timesNn Y is short for x times1 times2 middot middot middot timesn Y

where timesn represents tensorndashmatrix modendashn multiplicationas defined in [10 19] For the 2D case xtimes1M = Mx andxtimes2M = xMT The formula above reduces to

f lowastValid

д = AT[(GдGT ) ⊙ (Bf BT )

]A (4)

Which is consistent with Lavin et al [21]22 Winogradndash and FFTndash Based Convolution Layers

A convolutional layer transforms an input tuple of C imagesinto an output tuple ofC prime images A batch ofB inputs yieldinga batch of B outputs is processed at the time via

I primebc prime =Csumc=1

Ibc lowastWc primec (5)

WhereC andC prime denote the number of inputoutput images(also called channels or featurendashmaps) Ibc and I primebc prime are the(arbitrary dimensional) input and output images of the bndashthbatch In total B middotC middotC prime convolutions are performedUsing Winograd or Regular FFT convolution the output

images are computed via

I primebc prime =Csumc=1

[(Wcc primetimesNn=1Gn ) ⊙ (IbctimesNn=1Bn )

]timesNn=1A

Tn

=[ Csumc=1

(Wcc primetimesNn=1Gn ) ⊙ (IbctimesNn=1Bn )]timesNn=1A

Tn

(6)

Note that we can have different sizes for matrices An Bnand Gn for each dimension

2

Fn(mn rn) and Fn(mn rn) assume a particular size of I(mn + rn minus 1) and I prime (mn) along the nndashth dimension Forlarger image sizes the convolution is performed using theoverlapndashaddmethod (OLA) [27]With OLA the input imagesare divided into tiles with sizes ofmn +rn minus1 and an overlapof rn minus 1 along the nndashth dimension Considering tiles at thesame location from all the input images tiles of sizemn ofthe output images are computed using the formula aboveThe main savings in computation in both the Winograd

and FFT methods comes from the fact that both the ker-nel transforms (Wcc prime timesN

n=1 Gn) and image (tiles) transforms(Ibc timesN

n=1 Bn) can be precomputed and reused many timesThe computation is dominated by computing the dot prod-ucts ndash accumulation of elementndashwise product inside thesquare brackets in Eqn 6 Computing all the dot productsin Eqn 6 is an equivalent problem to matrix multiplicationswith real matrices for the case of Winograd and complexmatrices for the case of Regular FFT convolution

23 Gaussrsquo Multiplication of Complex Numbers

In the Regular FFT convolution the computation is domi-nated by complex matrix multiplications where each com-plex number pair multiplication requires 4 real multiplica-tions when computed directly and 3 real multiplicationswhen using Gaussrsquo multiplication algorithm [21 23]

Using Gaussrsquo multiplication algorithm the product of twocomplex numbers ur + uii and vr +vii is computed by firstcomputing tmp1 = vr (ur + ui ) tmp2 = ur (vi minus vr ) andtmp3 = ui (vr +vi ) The real part of the result is then equal totmp1minustmp3 and the imaginary part to tmp1+tmp2 Similarlyan elementndashwise product of two complex tensors U ⊙ V(U = Ur +Uii andV = Vr +Vii) can be performed using threeelementndashwise products of realndashvalued tensorsFor the RegularndashFFT convolutional layer elementndashwise

product of complex tensors representing the image (tile)transforms and kernel transforms are performed and eachtensor is reused many times (Eqn 6) After performing atransform of an image tile which yields a complex tensorU = Ur +Uii a realndashvalued tensorUr +Ui can be computedand stored alongsideUr andUi Similarly tensorsVi minusVr andVr +Vi can be computed during kernel transforms and storedalongside Vr (Vi does not have to be stored) Each elementndashwise products of complex tensors can then be replaced withthree independent elementndashwise products of realndashvaluedtensorsThe resulting three real tensors are implicitly converted

back to a single complex tensor during the computation ofinverse transform (timesN

n=1ATn )

Computing all the dot products in Eqn 6 is then performedusing three realndashvalued matrix multiplications instead of asingle complex matrix multiplication reducing the numberof required operations by 25We refer to the FFT method using Gaussrsquo multiplication

as GaussndashFFT (G(m r ))

3 ImplementationsBoth Winograd and FFT approaches perform computationsin four distinct stages input transform kernel transformelementndashwise computation and inverse transform The firsttwo stages convert the imageskernels from a spatial domaininto Winograd or FFT domain The third stage is equiva-lent to matrix multiplications The inverse transform stageconverts the results back to the spatial domain

We extended the publicly available Winograd implementa-tion from [3 18] which was highly optimized for manyndashcoreCPUs to support arbitrary AVX2 and AVX512 multindashcoreCPUs We used it as a base for our FFT implementations Wereused most of the code (90) in order to leverage the sameoptimization methodsOptimizations In order to achieve high utilization of thehardware both Winograd and FFT methods use the iden-tical optimizations as proposed in [18] including softwareprefetching memory and cache blocking using aligned vec-tor data access and interleaving memory access with com-putationWe adopted the data layout proposed in [17 18 38] for

input images kernels and output where 16 images are inter-leaved in memory for easy vectorization In [18] 16 was useddue to the size of the AVX512 vector register we keep it to 16regardless of the vector register size as 16 is the cachendashlinewidth (16 32ndashbit floats) to facilitate efficient utilization ofthe memory subsystem

For data handndashoff between the four stages of the algorithmstreaming stores to main memory were used since the datawill not be used in near future This savesmemory bandwidthand avoids cache pollutionOverall all three implementations achieved high utiliza-

tion of the available hardware as discussed in the followingsectionsTransforms To perform transforms of the input images andkernels as well as the output images the implementationof [18] provides C++ codelets that perform transforms of 16tiles at the same time The codelets are created by generatingWinograd transforms using Wincnn [7] after which a com-putation graph is created and optimized yielding codeletsthat utilize AVX512 instructions to transform 16 tiles at thetimeFor the FFTndashbased implementations the codelets were

replaced by C++ codelets generated using ldquogenfftrdquo suppliedwith FFTW [13] ldquoGenfftrdquo wasmodified so that it can generatecodelets that perform implicitly zerondashpadded FFT transformsas well as computing only a subset of elements of the inversetransform Multidimensional (forward) transforms were im-plemented by combining codelets performing implicitly zerondashpadded realndashtondashcomplex transforms along one dimensionand ones performing complexndashtondashcomplex transforms alongother dimensions Backward transforms combined complexndashtondashcomplex transform codelets and complexndashtondashreal onesDifferent from existing FFTndashbased convolutions which limittransform size to small prime factors [37] or only numbers

3

Table 1 Machine configurations used for benchmarking The MBrepresents the theoretical peak memory bandwidth and CMR rep-resents the ratio between the available FLOPS and the memorybandwidth

CPU Cores GFLOPS AVX Cache MB(GBs) CMR

Xeon Phi 7210 64 4506 512 512 KB 4096 11i7-6950X 10 960 2 1 MB 683 1406i9-7900X 10 2122 512 1 MB 96 22Xeon Gold 6148 20 3072 512 1 MB 128 24E7-8890v3 18 1440 2 256 KB 512 2813Xeon Platinum 8124M 18 3456 512 1MB 1152 30i9-7900X 10 2122 512 1 MB 683 31Xeon Phi 7210 48 4506 512 512 KB 1024 33Xeon Phi 7210 64 4506 512 512 KB 1024 3911i9-7900X 10 2122 512 1 MB 512 4125

that are powers of two [9 24] our implementations supportarbitrary sizesElementndashWise Multiplications For the elementndashwisestage where matrixndashmatrix multiplications are performedthe implementation of [18] provides JIT routines for realndashmatrix multiplications optimized for AVX512 instruction setFollowing the same principles of [18] we implemented JITrealndashmatrix multiplication routines optimized for the AVX2instruction set as well as complexndashnumber multiplicationroutines for both AVX512 and AVX2 instruction sets whichare required for the RegularndashFFT methodParallelization Through Static Scheduling Each of thestages of our algorithm is parallelized using static schedulingoriginally proposed in [38] using the generalized implemen-tation provided by [18] To achieve optimal performanceeach core is assigned roughly the same amount of compu-tation The work is then executed using a single forkndashjoinroutine

4 Performance ComparisonsTo compare running times of the FFT and Winograd meth-ods we benchmarked our FFTndashbased implementations andthe improved Winograd implementation of Jia et al [18] byusing the two most popular ConvNets AlexNet [20] andVGG [28] which are frequently used for benchmarking [4]We benchmarked the time required for the forward propaga-tion of each distinct layers of the two networksAdditional publicly available implementations were in-

cluded for reference ndash The Winograd implementations pro-vided by LIBXSMM [6] and MKL-DNN [2] and the directconvolution provided by MKL-DNN [2] To the best of ourknowledge no other implementation of FFTndashbased methodsfor CPUs beside our own is currently available

Both Winograd and FFT methods can work with arbitrarytransform sizes Generally larger transform size decrease thenumber of required operations However the numerical in-accuracy of Winograd convolutions increases exponentiallywith transform (tile) sizes [18 21 26] In practice there isan upper bound on the transform size for which the Wino-grad produces satisfactory results All major vendors suchas FALCON MKL-DNN LIBXSMM and cuDNN [1 2 6 9]

implement theWinograd algorithm with transform sizes lessthan or equal to 6times 6 2 For these tile sizes the numerical er-ror of computation is comparable to the one that implementsconvolution directlyWe follow the same convention and consider Winograd

convolutions only with tile sizes of at most 6 times 6 Howeverwe allow the FFT methods to have an arbitrary large tile sizeas they donrsquot suffer from such numerical instability

Running Time Experiments The benchmarks were per-formed on a total of 10 systems showed in Tbl 1In Fig 1 we show detailed results on one of the systems

Note that both LIBXSMMrsquos and MKL-DNNrsquos Winograd im-plementation support only kernel sizes of 3times3 and thuscan not use the Winograd method for the second layer ofAlexNet

The Winogradndashbased method outperformed in only 3 outof 12 layers whereas the a FFTndashbased method outperformedon 6 layers and on 3 layers they had roughly the same per-formances More importantly in the cases when the FFTmethods outperformed they did it with a larger marginand on the layers that require more computation time Thissuggests that the FFT methods can allow for significant sav-ings in the overall computation time when compared to theWinograd For instance the time spent on all convolutionallayers in AlexNet using the Winograd method would con-sume 5879 milliseconds whereas the ReguralndashFFT methodrequires only 3196 milliseconds that is a speedup of 184xAdditionally in Fig 2 we show the normalized running

time on all other AVX512 systems (neither LIBXSMM norMKL-DNN support the AVX2 instruction set) The resultsfor each individual layer are scaled based on the slowestimplementation as we are only interesting in the relativeperformances Detailed results are available in Appendix Inall cases our two FFTndashbased implementations as well as themodified Winograd implementation of [18] outperformedother publicly available implementations Thus for the restof the paper we will only focus on these three implementa-tions

FFT transform sizes An important observation was thatthe optimal transform sizes for the FFT method were not al-ways powers of two they were 27 for VGG12 25 for VGG21and VGG22 21 for VGG 31 and VGG32 16 for VGG41 andVGG42 and 9 for VGG5152 For AlexNet the optimal sizeswere 31 for the second layer and 15 for all other layers Thisis counter intuitive as the common belief is that optimal FFTtransforms should use sizes that only contain factors thatare small primes [13 37] or transforms with sizes equal tothe powers of two which is suggested by the two GPU FFTndashbased implementations ie fbfft [24 31] and CuDNN [11]

2With the transform sizes of 6times 6 the average numerical error of the Wino-grad method on benchmarked layers was 703 middot 10minus6 which is similar to theerror of directndashconvolution 111 middot10minus6 When the transform size is increasedto 8 times 8 the error is increased to 124 middot 10minus3 which was expected [18] Thenumerical error of the FFTndashbased method was no larger than 288 middot 10minus7regardless of the tile size

4

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 4

935

253

620

63

809

702

672

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

Figure 1 Convolution layersrsquo running time on Xeon Gold CPU

7210

(11)

i9minus

7900

X(2

2)i9

minus79

00X

(31)

7210

(33)

7210

(39

11)

AlexNet

2

AlexNet

3

AlexNet

4

AlexNet

5

VGG 12

VGG 21

VGG 22

VGG 31

VGG 32

VGG 41

VGG 42

VGG 51

52

i9minus

7900

x(4

125

)

RegularminusFFT

GaussminusFFT

Winograd (Jia et al)

MKLminusDNN Win

MKLminusDNN Direct

LIBXSMM Win

Figure 2 Normalized running time of different convolution imple-mentations (The systems with the same CPUs are distinguishedby their CMR )

Relative performances While an FFTndashmethod outper-formed the Winograd more often than not the relative per-formances varied among different layers and systems In therest of the paper we focus on the theoretical analysis of allthe methods We would like to understand why our findingsare not aligned with the popular belief that the Winogradmethod should be strictly faster

5 Performance AnalysisOur experimental observations suggested that some of thestages in both Winogradndash and FFTndashbased approach haverelatively low utilization of the systemrsquos available FLOPSIn most but not all cases these were the transform stageswhich have relatively small amount of compute but accessrelatively large amount of data suggesting that their run-ning time is bound by the memory bandwidth and not thecomputational capabilities of the CPU

For this reason we used the Roofline [33] performancemodel to analyze Winogradndash and FFTndash based convolutions

RooflinePerformanceModel is an ISA oblivious through-put oriented performance model used to estimate perfor-mance of an application running on processing units and isoften used to predict performances on CPUs GPUs TPUs(Tensor Processing Units) etc

It is suitable for analyzing particular methods on systemswhere the memory bandwidth often becomes the constrain-ing resource It estimates the performance bound of an algo-rithm as a function of its arithmetic intensity (AI) which isdefined as the ratio of total floatingndashpoint operations (FPO)and the total data movement (DM) in bytes (AI= FPODM)between different levels of the memory hierarchy For eachlevel of memory hierarchy the performance ceiling (attain-able FLOPS) is determined by

Attainable FLOPS = min(Peak FLOPS MB timesAI ) (7)

Where Peak FLOPS is defined as the maximal numberof floating operations per second of a given system andMB as systemrsquos peak memory bandwidth When plotted theperformance ceiling line resembles a rooflineHere we are interesting in the DM between the highest

level of onndashchip corendashexclusive cache (typically L2 for CPUs)and offndashchip shared memory 3 In a systems where the L2 isshared among a small number of cores (eg Intel Xeon Phiseries share L2 cache between two cores) the L2 is assumedto be equally divided for exclusive usage among the cores

DM then accounts for all of regular and streaming storesto main memory regular loads from main memory and prendashfetches from main memory to either L1 or L2Our analysis does not consider the presence of higher

level shared caches as on modern processors the sizes of

3The term offndashchip refers to the main memory of the system typically largebut much lower throughput and higher latency than the onndashchip cachesThe HBWMCDRAM of the Knights Landing processor would be consideredoffndashchip

5

such caches are very limited and can not fit the working setof a typical convolutional layers

Additional performance ceilings for lower levels of cacheare not necessary since in both Winogradndashbased and FFTndashbased convolutional algorithms the computation is eitherbounded by transfers to and from main memory or the pro-cessor peak FLOPS as shown in the following sections

As the performance ceiling is set by algorithmrsquos arithmeticintensity (Eqn 7) its running time can be estimated by

runninд t ime =F PO

Attainable FLOPS

=F PO

min(Peak FLOPS MB times AI ) =F POMB

min(CMR AI )

=

FPO

Peak FLOPS CMR lt= AI

DMMB CMR gt AI

(8)Here CMR is the systemrsquos computendashtondashmemory ratio de-fined as the ratio of itrsquos Peak FLOPS and MB ndash the mem-ory bandwidth FPO represents the total number of floatingpoint operations required and DM (the total amount of datamovements in bytes) as defined above The running timeis compute bound when CMR lt= AI in which case it canbe computed as F PO

Peak FLOPS otherwise the running time ismemory bound and can be computed as DM

MB Estimating running time For the Wingoradndash and FFTndashbased convolutional layers where the computation is com-posed of several sequential stages (S) each with a uniquearithmetic intensity (AI) the running time is calculated byaccumulating the running times of each stage s isin S

runninд timeTotal =Ssum

runninд times =Ssum FPOsMB

min(CMRAIs )(9)

51 Relative Performances

We are interested in the relative performance betweenWino-grad and FFTmethods We define Speedup(AB) as the ratioof the running times of an algorithm B and an algorithm A

Speedup(AB) =runninд timeBTotalrunninд timeATotal

(10)

A speedup greater than one indicates that the algorithm Ais faster and smaller than one means that the algorithm Bis fasterWhile Eqn 9 estimates the running time of an algorithm

assuming perfect utilization of the hardware which is rarelypossible in practice However the Eqn 10 is also valid whenthe compared algorithms are equally optimized meaningthat they utilize the same percentage of the hardware capa-bilitiesIn the Eqn 10 the value of AI will differ between Wino-

grad and FFT based approaches and will also depend on theamount of available cache size [18] Therefore when compar-ing performance of two algorithms the relative performanceon a given system will depend on the CMR ratio and the

amount of available cache but not on the absolute values ofthe systemrsquos compute speed or memory throughput

Detailed analysis on obtaining the values of AI DMand FPO are presented in the Appendix A

Using Eqn 10 and the values of AI DM and FPO calculatedin the Appendix A (Tbl 2) we estimate the speedup betweenthe FFTndashbased methods and the Winogradndashbased one Forall three methods the tile sizes are chosen in a way thatminimizes the total running time (Eqn 9)

52 The Accuracy of the Theoretical Estimates

In Fig 3 we plot the estimated theoretical speedup of thetwo FFTndashbased methods over the Winogradndashbased one Thesolid lines represent the theoretical estimates of the rela-tive performances as a function of the systemrsquos CMR Thecolor of the line represents the amount of available L2 cacheThe lines are drawn in a semindashtranslucent fashion as theyoverlap in many placesThe empirical results from the benchmarks described in

Section 4 are overlaid each crossndashhair represents the mea-surement of the relative performances on a single systemThe x coordinate of the cross-hairs represents the systemrsquoscomputendashtondashmemory (CMR) ratio (see Tbl 1) and the colorrepresents the L2 cache sizesOur empirical results are consistent with our theoreti-

cal estimates The overall relative root mean square error(rRMSE) was 0079 for RegularndashFFT vs Winograd and 01for GaussndashFFT vs Winograd The fitness 4 was 9268 forRegularndashFFT vs Winograd and 90 for GaussndashFFT vs Wino-grad

53 Discussions

Hardware utilization We have also measured the systemutilization of each stage of the three methods While theutilization varied across benchmarked layers and systemson average during the compute bound stages 75 of theo-retical peak FLOPS were attained in memory bound stagesslightly more than 85 of the theoretical memory bandwidthwas achieved This resulted in a somehow lower effectiveCMR which is consistent with the results shown in Fig 3The empirical results are slightly shifted to the left (towardslower values of CMR) when compared to the theoretical pre-dictions which assume equal utilization of both the FLOPSand the memory bandwidth

Optimality of Tile Transforms Both ldquowincnnrdquo [7] andldquogenfftrdquo [13] use heuristics to minimize the number of opera-tions for performing transforms and might not be optimalHowever as their AIs are much smaller than the CMRs ofmodern systems the computation is memory bound andthe running time of the transform stages will depend solelyon the amount of data movement The largest AI of the FFTtransforms is 555 and for Winograd 238 much lower thanCMRs of the modern CPUs The Xeon Phi Knights Landingprocessor has the CMR of 11 (due to onndashchip fast MCDRAM)

4fitness = 1001+rRMSE

6

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 2: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

movements (to and from memory) and their costs the arith-metic intensity (operations per moved byte) the processorrsquosspeed as well as the memory bandwidth We also analyzethe effects of cache sizes which determine the arithmeticintensity of both methods and indirectly affect the runningtimesLarge arithmetic intensity allows for better utilization of

hardware systems whose computendashtondashmemory ratios areincreasing over time because of the imbalance of the evo-lution between processor speeds and memory bandwidthsThe speeds for computations typically improve much fasterthan memory bandwidths [35] 1 This benefits the FFTndashbasedmethod more as its arithmetic intensity is generally higherdue to computing with complex numbers

Our analysis suggests that whether the Winogradndashbasedor a FFTndashbased approach is faster depend on the specificconvolutional layer and the particular system it is executedon However on average the FFTndashbased approaches outper-form the Winogradndashbased one with commonly used neu-ral networks with the margin increasing as the systemrsquoscomputendashtondashmemory ratio increasesThe findings in this paper challenge the current belief

that Winogradndashbased convolutions are better in nearly allpractical cases

2 Background21 Winogradndash and FFTndash Based Convolutions

Winograd ndash Based Convolution

As recently illustrated by Lavin et al [21] ldquovalidrdquo convo-lution of discrete signals the main operation used in mod-ern convolutional neural networks can be performed usingWinogradrsquos minimal filtering algorithm [34] via

f lowastValid

д = AT [(Gд) ⊙ (BT f )] (1)

Where f and д are 1D signals ⊙ represents elementndashwisemultiplication and AT BT and G are special matrices de-rived from Vandermonde matrices for Homogeneous Coor-dinate polynomials [32] By convention Winograd convolu-tions have thematricesAT BT andG in realndashspace In ldquovalidrdquoconvolution the filter slides across every ldquovalidrdquo location inthe filtered images ndash such that the filter is fully containedinside the image When the size of the filter is |д | = r f lowast

Validд

will have a length of | f | minus |д | + 1 = m and we refer to themethod above as Winograd convolution F(m r )FFT ndash Based Convolution

The convolution theorem states that a convolution can beperformed using Fourier transforms via

f lowastCirc

д = F minus1 (F (f ) middot F (д))

(2)

1For instance the 45 TFLOPS Intel Knights Landing processor [17] hasa computendashtondashmemory ratio of 11 whereas the latest Skylake Xeon andSkylakeX family of processors have reached the ratio in the range between20 to almost 40

Here F and F minus1 are Fourier and inverse Fourier trans-forms In the discrete case f and д need to have the samenumber of elements which can be accomplished by paddingzeros to the shorter signal

Discrete Fourier transform (DFT) results in a circular (alsoknown as cyclic) convolution The result of the ldquovalidrdquo con-volution can be extracted from the last | f | minus |д | + 1 = melements of the circular convolutionldquoValidrdquo convolution using discrete FFTs can also be re-

garded as a special case of the Eq 1 where the matrices AT BT and G are in complex space and are derived from Van-dermonde matrices with polynomial points being the rootsof unity [32] BT and G perform implicitly zerondashpadded (tosize ofm + r minus 1 = | f |) DFT transforms of f and д and ATcomputes the lastm elements of the inverse DFT transformUsing the FFT algorithm allows for efficient computationof matrixndashvector products with matrices AT BT and G Werefer to this special case as RegularndashFFT F(m r )MultindashDimensional Convolution

Both Winograd and FFT convolutions can easily be extendedto an arbitrary number of dimensions [10] Nndashdimensionalconvolution is performed via

f lowastValid

д =[(дtimesNn=1G) ⊙ (f timesNn=1B)

]timesNn=1A

Tn (3)

Here the operation x timesNn Y is short for x times1 times2 middot middot middot timesn Y

where timesn represents tensorndashmatrix modendashn multiplicationas defined in [10 19] For the 2D case xtimes1M = Mx andxtimes2M = xMT The formula above reduces to

f lowastValid

д = AT[(GдGT ) ⊙ (Bf BT )

]A (4)

Which is consistent with Lavin et al [21]22 Winogradndash and FFTndash Based Convolution Layers

A convolutional layer transforms an input tuple of C imagesinto an output tuple ofC prime images A batch ofB inputs yieldinga batch of B outputs is processed at the time via

I primebc prime =Csumc=1

Ibc lowastWc primec (5)

WhereC andC prime denote the number of inputoutput images(also called channels or featurendashmaps) Ibc and I primebc prime are the(arbitrary dimensional) input and output images of the bndashthbatch In total B middotC middotC prime convolutions are performedUsing Winograd or Regular FFT convolution the output

images are computed via

I primebc prime =Csumc=1

[(Wcc primetimesNn=1Gn ) ⊙ (IbctimesNn=1Bn )

]timesNn=1A

Tn

=[ Csumc=1

(Wcc primetimesNn=1Gn ) ⊙ (IbctimesNn=1Bn )]timesNn=1A

Tn

(6)

Note that we can have different sizes for matrices An Bnand Gn for each dimension

2

Fn(mn rn) and Fn(mn rn) assume a particular size of I(mn + rn minus 1) and I prime (mn) along the nndashth dimension Forlarger image sizes the convolution is performed using theoverlapndashaddmethod (OLA) [27]With OLA the input imagesare divided into tiles with sizes ofmn +rn minus1 and an overlapof rn minus 1 along the nndashth dimension Considering tiles at thesame location from all the input images tiles of sizemn ofthe output images are computed using the formula aboveThe main savings in computation in both the Winograd

and FFT methods comes from the fact that both the ker-nel transforms (Wcc prime timesN

n=1 Gn) and image (tiles) transforms(Ibc timesN

n=1 Bn) can be precomputed and reused many timesThe computation is dominated by computing the dot prod-ucts ndash accumulation of elementndashwise product inside thesquare brackets in Eqn 6 Computing all the dot productsin Eqn 6 is an equivalent problem to matrix multiplicationswith real matrices for the case of Winograd and complexmatrices for the case of Regular FFT convolution

23 Gaussrsquo Multiplication of Complex Numbers

In the Regular FFT convolution the computation is domi-nated by complex matrix multiplications where each com-plex number pair multiplication requires 4 real multiplica-tions when computed directly and 3 real multiplicationswhen using Gaussrsquo multiplication algorithm [21 23]

Using Gaussrsquo multiplication algorithm the product of twocomplex numbers ur + uii and vr +vii is computed by firstcomputing tmp1 = vr (ur + ui ) tmp2 = ur (vi minus vr ) andtmp3 = ui (vr +vi ) The real part of the result is then equal totmp1minustmp3 and the imaginary part to tmp1+tmp2 Similarlyan elementndashwise product of two complex tensors U ⊙ V(U = Ur +Uii andV = Vr +Vii) can be performed using threeelementndashwise products of realndashvalued tensorsFor the RegularndashFFT convolutional layer elementndashwise

product of complex tensors representing the image (tile)transforms and kernel transforms are performed and eachtensor is reused many times (Eqn 6) After performing atransform of an image tile which yields a complex tensorU = Ur +Uii a realndashvalued tensorUr +Ui can be computedand stored alongsideUr andUi Similarly tensorsVi minusVr andVr +Vi can be computed during kernel transforms and storedalongside Vr (Vi does not have to be stored) Each elementndashwise products of complex tensors can then be replaced withthree independent elementndashwise products of realndashvaluedtensorsThe resulting three real tensors are implicitly converted

back to a single complex tensor during the computation ofinverse transform (timesN

n=1ATn )

Computing all the dot products in Eqn 6 is then performedusing three realndashvalued matrix multiplications instead of asingle complex matrix multiplication reducing the numberof required operations by 25We refer to the FFT method using Gaussrsquo multiplication

as GaussndashFFT (G(m r ))

3 ImplementationsBoth Winograd and FFT approaches perform computationsin four distinct stages input transform kernel transformelementndashwise computation and inverse transform The firsttwo stages convert the imageskernels from a spatial domaininto Winograd or FFT domain The third stage is equiva-lent to matrix multiplications The inverse transform stageconverts the results back to the spatial domain

We extended the publicly available Winograd implementa-tion from [3 18] which was highly optimized for manyndashcoreCPUs to support arbitrary AVX2 and AVX512 multindashcoreCPUs We used it as a base for our FFT implementations Wereused most of the code (90) in order to leverage the sameoptimization methodsOptimizations In order to achieve high utilization of thehardware both Winograd and FFT methods use the iden-tical optimizations as proposed in [18] including softwareprefetching memory and cache blocking using aligned vec-tor data access and interleaving memory access with com-putationWe adopted the data layout proposed in [17 18 38] for

input images kernels and output where 16 images are inter-leaved in memory for easy vectorization In [18] 16 was useddue to the size of the AVX512 vector register we keep it to 16regardless of the vector register size as 16 is the cachendashlinewidth (16 32ndashbit floats) to facilitate efficient utilization ofthe memory subsystem

For data handndashoff between the four stages of the algorithmstreaming stores to main memory were used since the datawill not be used in near future This savesmemory bandwidthand avoids cache pollutionOverall all three implementations achieved high utiliza-

tion of the available hardware as discussed in the followingsectionsTransforms To perform transforms of the input images andkernels as well as the output images the implementationof [18] provides C++ codelets that perform transforms of 16tiles at the same time The codelets are created by generatingWinograd transforms using Wincnn [7] after which a com-putation graph is created and optimized yielding codeletsthat utilize AVX512 instructions to transform 16 tiles at thetimeFor the FFTndashbased implementations the codelets were

replaced by C++ codelets generated using ldquogenfftrdquo suppliedwith FFTW [13] ldquoGenfftrdquo wasmodified so that it can generatecodelets that perform implicitly zerondashpadded FFT transformsas well as computing only a subset of elements of the inversetransform Multidimensional (forward) transforms were im-plemented by combining codelets performing implicitly zerondashpadded realndashtondashcomplex transforms along one dimensionand ones performing complexndashtondashcomplex transforms alongother dimensions Backward transforms combined complexndashtondashcomplex transform codelets and complexndashtondashreal onesDifferent from existing FFTndashbased convolutions which limittransform size to small prime factors [37] or only numbers

3

Table 1 Machine configurations used for benchmarking The MBrepresents the theoretical peak memory bandwidth and CMR rep-resents the ratio between the available FLOPS and the memorybandwidth

CPU Cores GFLOPS AVX Cache MB(GBs) CMR

Xeon Phi 7210 64 4506 512 512 KB 4096 11i7-6950X 10 960 2 1 MB 683 1406i9-7900X 10 2122 512 1 MB 96 22Xeon Gold 6148 20 3072 512 1 MB 128 24E7-8890v3 18 1440 2 256 KB 512 2813Xeon Platinum 8124M 18 3456 512 1MB 1152 30i9-7900X 10 2122 512 1 MB 683 31Xeon Phi 7210 48 4506 512 512 KB 1024 33Xeon Phi 7210 64 4506 512 512 KB 1024 3911i9-7900X 10 2122 512 1 MB 512 4125

that are powers of two [9 24] our implementations supportarbitrary sizesElementndashWise Multiplications For the elementndashwisestage where matrixndashmatrix multiplications are performedthe implementation of [18] provides JIT routines for realndashmatrix multiplications optimized for AVX512 instruction setFollowing the same principles of [18] we implemented JITrealndashmatrix multiplication routines optimized for the AVX2instruction set as well as complexndashnumber multiplicationroutines for both AVX512 and AVX2 instruction sets whichare required for the RegularndashFFT methodParallelization Through Static Scheduling Each of thestages of our algorithm is parallelized using static schedulingoriginally proposed in [38] using the generalized implemen-tation provided by [18] To achieve optimal performanceeach core is assigned roughly the same amount of compu-tation The work is then executed using a single forkndashjoinroutine

4 Performance ComparisonsTo compare running times of the FFT and Winograd meth-ods we benchmarked our FFTndashbased implementations andthe improved Winograd implementation of Jia et al [18] byusing the two most popular ConvNets AlexNet [20] andVGG [28] which are frequently used for benchmarking [4]We benchmarked the time required for the forward propaga-tion of each distinct layers of the two networksAdditional publicly available implementations were in-

cluded for reference ndash The Winograd implementations pro-vided by LIBXSMM [6] and MKL-DNN [2] and the directconvolution provided by MKL-DNN [2] To the best of ourknowledge no other implementation of FFTndashbased methodsfor CPUs beside our own is currently available

Both Winograd and FFT methods can work with arbitrarytransform sizes Generally larger transform size decrease thenumber of required operations However the numerical in-accuracy of Winograd convolutions increases exponentiallywith transform (tile) sizes [18 21 26] In practice there isan upper bound on the transform size for which the Wino-grad produces satisfactory results All major vendors suchas FALCON MKL-DNN LIBXSMM and cuDNN [1 2 6 9]

implement theWinograd algorithm with transform sizes lessthan or equal to 6times 6 2 For these tile sizes the numerical er-ror of computation is comparable to the one that implementsconvolution directlyWe follow the same convention and consider Winograd

convolutions only with tile sizes of at most 6 times 6 Howeverwe allow the FFT methods to have an arbitrary large tile sizeas they donrsquot suffer from such numerical instability

Running Time Experiments The benchmarks were per-formed on a total of 10 systems showed in Tbl 1In Fig 1 we show detailed results on one of the systems

Note that both LIBXSMMrsquos and MKL-DNNrsquos Winograd im-plementation support only kernel sizes of 3times3 and thuscan not use the Winograd method for the second layer ofAlexNet

The Winogradndashbased method outperformed in only 3 outof 12 layers whereas the a FFTndashbased method outperformedon 6 layers and on 3 layers they had roughly the same per-formances More importantly in the cases when the FFTmethods outperformed they did it with a larger marginand on the layers that require more computation time Thissuggests that the FFT methods can allow for significant sav-ings in the overall computation time when compared to theWinograd For instance the time spent on all convolutionallayers in AlexNet using the Winograd method would con-sume 5879 milliseconds whereas the ReguralndashFFT methodrequires only 3196 milliseconds that is a speedup of 184xAdditionally in Fig 2 we show the normalized running

time on all other AVX512 systems (neither LIBXSMM norMKL-DNN support the AVX2 instruction set) The resultsfor each individual layer are scaled based on the slowestimplementation as we are only interesting in the relativeperformances Detailed results are available in Appendix Inall cases our two FFTndashbased implementations as well as themodified Winograd implementation of [18] outperformedother publicly available implementations Thus for the restof the paper we will only focus on these three implementa-tions

FFT transform sizes An important observation was thatthe optimal transform sizes for the FFT method were not al-ways powers of two they were 27 for VGG12 25 for VGG21and VGG22 21 for VGG 31 and VGG32 16 for VGG41 andVGG42 and 9 for VGG5152 For AlexNet the optimal sizeswere 31 for the second layer and 15 for all other layers Thisis counter intuitive as the common belief is that optimal FFTtransforms should use sizes that only contain factors thatare small primes [13 37] or transforms with sizes equal tothe powers of two which is suggested by the two GPU FFTndashbased implementations ie fbfft [24 31] and CuDNN [11]

2With the transform sizes of 6times 6 the average numerical error of the Wino-grad method on benchmarked layers was 703 middot 10minus6 which is similar to theerror of directndashconvolution 111 middot10minus6 When the transform size is increasedto 8 times 8 the error is increased to 124 middot 10minus3 which was expected [18] Thenumerical error of the FFTndashbased method was no larger than 288 middot 10minus7regardless of the tile size

4

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 4

935

253

620

63

809

702

672

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

Figure 1 Convolution layersrsquo running time on Xeon Gold CPU

7210

(11)

i9minus

7900

X(2

2)i9

minus79

00X

(31)

7210

(33)

7210

(39

11)

AlexNet

2

AlexNet

3

AlexNet

4

AlexNet

5

VGG 12

VGG 21

VGG 22

VGG 31

VGG 32

VGG 41

VGG 42

VGG 51

52

i9minus

7900

x(4

125

)

RegularminusFFT

GaussminusFFT

Winograd (Jia et al)

MKLminusDNN Win

MKLminusDNN Direct

LIBXSMM Win

Figure 2 Normalized running time of different convolution imple-mentations (The systems with the same CPUs are distinguishedby their CMR )

Relative performances While an FFTndashmethod outper-formed the Winograd more often than not the relative per-formances varied among different layers and systems In therest of the paper we focus on the theoretical analysis of allthe methods We would like to understand why our findingsare not aligned with the popular belief that the Winogradmethod should be strictly faster

5 Performance AnalysisOur experimental observations suggested that some of thestages in both Winogradndash and FFTndashbased approach haverelatively low utilization of the systemrsquos available FLOPSIn most but not all cases these were the transform stageswhich have relatively small amount of compute but accessrelatively large amount of data suggesting that their run-ning time is bound by the memory bandwidth and not thecomputational capabilities of the CPU

For this reason we used the Roofline [33] performancemodel to analyze Winogradndash and FFTndash based convolutions

RooflinePerformanceModel is an ISA oblivious through-put oriented performance model used to estimate perfor-mance of an application running on processing units and isoften used to predict performances on CPUs GPUs TPUs(Tensor Processing Units) etc

It is suitable for analyzing particular methods on systemswhere the memory bandwidth often becomes the constrain-ing resource It estimates the performance bound of an algo-rithm as a function of its arithmetic intensity (AI) which isdefined as the ratio of total floatingndashpoint operations (FPO)and the total data movement (DM) in bytes (AI= FPODM)between different levels of the memory hierarchy For eachlevel of memory hierarchy the performance ceiling (attain-able FLOPS) is determined by

Attainable FLOPS = min(Peak FLOPS MB timesAI ) (7)

Where Peak FLOPS is defined as the maximal numberof floating operations per second of a given system andMB as systemrsquos peak memory bandwidth When plotted theperformance ceiling line resembles a rooflineHere we are interesting in the DM between the highest

level of onndashchip corendashexclusive cache (typically L2 for CPUs)and offndashchip shared memory 3 In a systems where the L2 isshared among a small number of cores (eg Intel Xeon Phiseries share L2 cache between two cores) the L2 is assumedto be equally divided for exclusive usage among the cores

DM then accounts for all of regular and streaming storesto main memory regular loads from main memory and prendashfetches from main memory to either L1 or L2Our analysis does not consider the presence of higher

level shared caches as on modern processors the sizes of

3The term offndashchip refers to the main memory of the system typically largebut much lower throughput and higher latency than the onndashchip cachesThe HBWMCDRAM of the Knights Landing processor would be consideredoffndashchip

5

such caches are very limited and can not fit the working setof a typical convolutional layers

Additional performance ceilings for lower levels of cacheare not necessary since in both Winogradndashbased and FFTndashbased convolutional algorithms the computation is eitherbounded by transfers to and from main memory or the pro-cessor peak FLOPS as shown in the following sections

As the performance ceiling is set by algorithmrsquos arithmeticintensity (Eqn 7) its running time can be estimated by

runninд t ime =F PO

Attainable FLOPS

=F PO

min(Peak FLOPS MB times AI ) =F POMB

min(CMR AI )

=

FPO

Peak FLOPS CMR lt= AI

DMMB CMR gt AI

(8)Here CMR is the systemrsquos computendashtondashmemory ratio de-fined as the ratio of itrsquos Peak FLOPS and MB ndash the mem-ory bandwidth FPO represents the total number of floatingpoint operations required and DM (the total amount of datamovements in bytes) as defined above The running timeis compute bound when CMR lt= AI in which case it canbe computed as F PO

Peak FLOPS otherwise the running time ismemory bound and can be computed as DM

MB Estimating running time For the Wingoradndash and FFTndashbased convolutional layers where the computation is com-posed of several sequential stages (S) each with a uniquearithmetic intensity (AI) the running time is calculated byaccumulating the running times of each stage s isin S

runninд timeTotal =Ssum

runninд times =Ssum FPOsMB

min(CMRAIs )(9)

51 Relative Performances

We are interested in the relative performance betweenWino-grad and FFTmethods We define Speedup(AB) as the ratioof the running times of an algorithm B and an algorithm A

Speedup(AB) =runninд timeBTotalrunninд timeATotal

(10)

A speedup greater than one indicates that the algorithm Ais faster and smaller than one means that the algorithm Bis fasterWhile Eqn 9 estimates the running time of an algorithm

assuming perfect utilization of the hardware which is rarelypossible in practice However the Eqn 10 is also valid whenthe compared algorithms are equally optimized meaningthat they utilize the same percentage of the hardware capa-bilitiesIn the Eqn 10 the value of AI will differ between Wino-

grad and FFT based approaches and will also depend on theamount of available cache size [18] Therefore when compar-ing performance of two algorithms the relative performanceon a given system will depend on the CMR ratio and the

amount of available cache but not on the absolute values ofthe systemrsquos compute speed or memory throughput

Detailed analysis on obtaining the values of AI DMand FPO are presented in the Appendix A

Using Eqn 10 and the values of AI DM and FPO calculatedin the Appendix A (Tbl 2) we estimate the speedup betweenthe FFTndashbased methods and the Winogradndashbased one Forall three methods the tile sizes are chosen in a way thatminimizes the total running time (Eqn 9)

52 The Accuracy of the Theoretical Estimates

In Fig 3 we plot the estimated theoretical speedup of thetwo FFTndashbased methods over the Winogradndashbased one Thesolid lines represent the theoretical estimates of the rela-tive performances as a function of the systemrsquos CMR Thecolor of the line represents the amount of available L2 cacheThe lines are drawn in a semindashtranslucent fashion as theyoverlap in many placesThe empirical results from the benchmarks described in

Section 4 are overlaid each crossndashhair represents the mea-surement of the relative performances on a single systemThe x coordinate of the cross-hairs represents the systemrsquoscomputendashtondashmemory (CMR) ratio (see Tbl 1) and the colorrepresents the L2 cache sizesOur empirical results are consistent with our theoreti-

cal estimates The overall relative root mean square error(rRMSE) was 0079 for RegularndashFFT vs Winograd and 01for GaussndashFFT vs Winograd The fitness 4 was 9268 forRegularndashFFT vs Winograd and 90 for GaussndashFFT vs Wino-grad

53 Discussions

Hardware utilization We have also measured the systemutilization of each stage of the three methods While theutilization varied across benchmarked layers and systemson average during the compute bound stages 75 of theo-retical peak FLOPS were attained in memory bound stagesslightly more than 85 of the theoretical memory bandwidthwas achieved This resulted in a somehow lower effectiveCMR which is consistent with the results shown in Fig 3The empirical results are slightly shifted to the left (towardslower values of CMR) when compared to the theoretical pre-dictions which assume equal utilization of both the FLOPSand the memory bandwidth

Optimality of Tile Transforms Both ldquowincnnrdquo [7] andldquogenfftrdquo [13] use heuristics to minimize the number of opera-tions for performing transforms and might not be optimalHowever as their AIs are much smaller than the CMRs ofmodern systems the computation is memory bound andthe running time of the transform stages will depend solelyon the amount of data movement The largest AI of the FFTtransforms is 555 and for Winograd 238 much lower thanCMRs of the modern CPUs The Xeon Phi Knights Landingprocessor has the CMR of 11 (due to onndashchip fast MCDRAM)

4fitness = 1001+rRMSE

6

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 3: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Fn(mn rn) and Fn(mn rn) assume a particular size of I(mn + rn minus 1) and I prime (mn) along the nndashth dimension Forlarger image sizes the convolution is performed using theoverlapndashaddmethod (OLA) [27]With OLA the input imagesare divided into tiles with sizes ofmn +rn minus1 and an overlapof rn minus 1 along the nndashth dimension Considering tiles at thesame location from all the input images tiles of sizemn ofthe output images are computed using the formula aboveThe main savings in computation in both the Winograd

and FFT methods comes from the fact that both the ker-nel transforms (Wcc prime timesN

n=1 Gn) and image (tiles) transforms(Ibc timesN

n=1 Bn) can be precomputed and reused many timesThe computation is dominated by computing the dot prod-ucts ndash accumulation of elementndashwise product inside thesquare brackets in Eqn 6 Computing all the dot productsin Eqn 6 is an equivalent problem to matrix multiplicationswith real matrices for the case of Winograd and complexmatrices for the case of Regular FFT convolution

23 Gaussrsquo Multiplication of Complex Numbers

In the Regular FFT convolution the computation is domi-nated by complex matrix multiplications where each com-plex number pair multiplication requires 4 real multiplica-tions when computed directly and 3 real multiplicationswhen using Gaussrsquo multiplication algorithm [21 23]

Using Gaussrsquo multiplication algorithm the product of twocomplex numbers ur + uii and vr +vii is computed by firstcomputing tmp1 = vr (ur + ui ) tmp2 = ur (vi minus vr ) andtmp3 = ui (vr +vi ) The real part of the result is then equal totmp1minustmp3 and the imaginary part to tmp1+tmp2 Similarlyan elementndashwise product of two complex tensors U ⊙ V(U = Ur +Uii andV = Vr +Vii) can be performed using threeelementndashwise products of realndashvalued tensorsFor the RegularndashFFT convolutional layer elementndashwise

product of complex tensors representing the image (tile)transforms and kernel transforms are performed and eachtensor is reused many times (Eqn 6) After performing atransform of an image tile which yields a complex tensorU = Ur +Uii a realndashvalued tensorUr +Ui can be computedand stored alongsideUr andUi Similarly tensorsVi minusVr andVr +Vi can be computed during kernel transforms and storedalongside Vr (Vi does not have to be stored) Each elementndashwise products of complex tensors can then be replaced withthree independent elementndashwise products of realndashvaluedtensorsThe resulting three real tensors are implicitly converted

back to a single complex tensor during the computation ofinverse transform (timesN

n=1ATn )

Computing all the dot products in Eqn 6 is then performedusing three realndashvalued matrix multiplications instead of asingle complex matrix multiplication reducing the numberof required operations by 25We refer to the FFT method using Gaussrsquo multiplication

as GaussndashFFT (G(m r ))

3 ImplementationsBoth Winograd and FFT approaches perform computationsin four distinct stages input transform kernel transformelementndashwise computation and inverse transform The firsttwo stages convert the imageskernels from a spatial domaininto Winograd or FFT domain The third stage is equiva-lent to matrix multiplications The inverse transform stageconverts the results back to the spatial domain

We extended the publicly available Winograd implementa-tion from [3 18] which was highly optimized for manyndashcoreCPUs to support arbitrary AVX2 and AVX512 multindashcoreCPUs We used it as a base for our FFT implementations Wereused most of the code (90) in order to leverage the sameoptimization methodsOptimizations In order to achieve high utilization of thehardware both Winograd and FFT methods use the iden-tical optimizations as proposed in [18] including softwareprefetching memory and cache blocking using aligned vec-tor data access and interleaving memory access with com-putationWe adopted the data layout proposed in [17 18 38] for

input images kernels and output where 16 images are inter-leaved in memory for easy vectorization In [18] 16 was useddue to the size of the AVX512 vector register we keep it to 16regardless of the vector register size as 16 is the cachendashlinewidth (16 32ndashbit floats) to facilitate efficient utilization ofthe memory subsystem

For data handndashoff between the four stages of the algorithmstreaming stores to main memory were used since the datawill not be used in near future This savesmemory bandwidthand avoids cache pollutionOverall all three implementations achieved high utiliza-

tion of the available hardware as discussed in the followingsectionsTransforms To perform transforms of the input images andkernels as well as the output images the implementationof [18] provides C++ codelets that perform transforms of 16tiles at the same time The codelets are created by generatingWinograd transforms using Wincnn [7] after which a com-putation graph is created and optimized yielding codeletsthat utilize AVX512 instructions to transform 16 tiles at thetimeFor the FFTndashbased implementations the codelets were

replaced by C++ codelets generated using ldquogenfftrdquo suppliedwith FFTW [13] ldquoGenfftrdquo wasmodified so that it can generatecodelets that perform implicitly zerondashpadded FFT transformsas well as computing only a subset of elements of the inversetransform Multidimensional (forward) transforms were im-plemented by combining codelets performing implicitly zerondashpadded realndashtondashcomplex transforms along one dimensionand ones performing complexndashtondashcomplex transforms alongother dimensions Backward transforms combined complexndashtondashcomplex transform codelets and complexndashtondashreal onesDifferent from existing FFTndashbased convolutions which limittransform size to small prime factors [37] or only numbers

3

Table 1 Machine configurations used for benchmarking The MBrepresents the theoretical peak memory bandwidth and CMR rep-resents the ratio between the available FLOPS and the memorybandwidth

CPU Cores GFLOPS AVX Cache MB(GBs) CMR

Xeon Phi 7210 64 4506 512 512 KB 4096 11i7-6950X 10 960 2 1 MB 683 1406i9-7900X 10 2122 512 1 MB 96 22Xeon Gold 6148 20 3072 512 1 MB 128 24E7-8890v3 18 1440 2 256 KB 512 2813Xeon Platinum 8124M 18 3456 512 1MB 1152 30i9-7900X 10 2122 512 1 MB 683 31Xeon Phi 7210 48 4506 512 512 KB 1024 33Xeon Phi 7210 64 4506 512 512 KB 1024 3911i9-7900X 10 2122 512 1 MB 512 4125

that are powers of two [9 24] our implementations supportarbitrary sizesElementndashWise Multiplications For the elementndashwisestage where matrixndashmatrix multiplications are performedthe implementation of [18] provides JIT routines for realndashmatrix multiplications optimized for AVX512 instruction setFollowing the same principles of [18] we implemented JITrealndashmatrix multiplication routines optimized for the AVX2instruction set as well as complexndashnumber multiplicationroutines for both AVX512 and AVX2 instruction sets whichare required for the RegularndashFFT methodParallelization Through Static Scheduling Each of thestages of our algorithm is parallelized using static schedulingoriginally proposed in [38] using the generalized implemen-tation provided by [18] To achieve optimal performanceeach core is assigned roughly the same amount of compu-tation The work is then executed using a single forkndashjoinroutine

4 Performance ComparisonsTo compare running times of the FFT and Winograd meth-ods we benchmarked our FFTndashbased implementations andthe improved Winograd implementation of Jia et al [18] byusing the two most popular ConvNets AlexNet [20] andVGG [28] which are frequently used for benchmarking [4]We benchmarked the time required for the forward propaga-tion of each distinct layers of the two networksAdditional publicly available implementations were in-

cluded for reference ndash The Winograd implementations pro-vided by LIBXSMM [6] and MKL-DNN [2] and the directconvolution provided by MKL-DNN [2] To the best of ourknowledge no other implementation of FFTndashbased methodsfor CPUs beside our own is currently available

Both Winograd and FFT methods can work with arbitrarytransform sizes Generally larger transform size decrease thenumber of required operations However the numerical in-accuracy of Winograd convolutions increases exponentiallywith transform (tile) sizes [18 21 26] In practice there isan upper bound on the transform size for which the Wino-grad produces satisfactory results All major vendors suchas FALCON MKL-DNN LIBXSMM and cuDNN [1 2 6 9]

implement theWinograd algorithm with transform sizes lessthan or equal to 6times 6 2 For these tile sizes the numerical er-ror of computation is comparable to the one that implementsconvolution directlyWe follow the same convention and consider Winograd

convolutions only with tile sizes of at most 6 times 6 Howeverwe allow the FFT methods to have an arbitrary large tile sizeas they donrsquot suffer from such numerical instability

Running Time Experiments The benchmarks were per-formed on a total of 10 systems showed in Tbl 1In Fig 1 we show detailed results on one of the systems

Note that both LIBXSMMrsquos and MKL-DNNrsquos Winograd im-plementation support only kernel sizes of 3times3 and thuscan not use the Winograd method for the second layer ofAlexNet

The Winogradndashbased method outperformed in only 3 outof 12 layers whereas the a FFTndashbased method outperformedon 6 layers and on 3 layers they had roughly the same per-formances More importantly in the cases when the FFTmethods outperformed they did it with a larger marginand on the layers that require more computation time Thissuggests that the FFT methods can allow for significant sav-ings in the overall computation time when compared to theWinograd For instance the time spent on all convolutionallayers in AlexNet using the Winograd method would con-sume 5879 milliseconds whereas the ReguralndashFFT methodrequires only 3196 milliseconds that is a speedup of 184xAdditionally in Fig 2 we show the normalized running

time on all other AVX512 systems (neither LIBXSMM norMKL-DNN support the AVX2 instruction set) The resultsfor each individual layer are scaled based on the slowestimplementation as we are only interesting in the relativeperformances Detailed results are available in Appendix Inall cases our two FFTndashbased implementations as well as themodified Winograd implementation of [18] outperformedother publicly available implementations Thus for the restof the paper we will only focus on these three implementa-tions

FFT transform sizes An important observation was thatthe optimal transform sizes for the FFT method were not al-ways powers of two they were 27 for VGG12 25 for VGG21and VGG22 21 for VGG 31 and VGG32 16 for VGG41 andVGG42 and 9 for VGG5152 For AlexNet the optimal sizeswere 31 for the second layer and 15 for all other layers Thisis counter intuitive as the common belief is that optimal FFTtransforms should use sizes that only contain factors thatare small primes [13 37] or transforms with sizes equal tothe powers of two which is suggested by the two GPU FFTndashbased implementations ie fbfft [24 31] and CuDNN [11]

2With the transform sizes of 6times 6 the average numerical error of the Wino-grad method on benchmarked layers was 703 middot 10minus6 which is similar to theerror of directndashconvolution 111 middot10minus6 When the transform size is increasedto 8 times 8 the error is increased to 124 middot 10minus3 which was expected [18] Thenumerical error of the FFTndashbased method was no larger than 288 middot 10minus7regardless of the tile size

4

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 4

935

253

620

63

809

702

672

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

Figure 1 Convolution layersrsquo running time on Xeon Gold CPU

7210

(11)

i9minus

7900

X(2

2)i9

minus79

00X

(31)

7210

(33)

7210

(39

11)

AlexNet

2

AlexNet

3

AlexNet

4

AlexNet

5

VGG 12

VGG 21

VGG 22

VGG 31

VGG 32

VGG 41

VGG 42

VGG 51

52

i9minus

7900

x(4

125

)

RegularminusFFT

GaussminusFFT

Winograd (Jia et al)

MKLminusDNN Win

MKLminusDNN Direct

LIBXSMM Win

Figure 2 Normalized running time of different convolution imple-mentations (The systems with the same CPUs are distinguishedby their CMR )

Relative performances While an FFTndashmethod outper-formed the Winograd more often than not the relative per-formances varied among different layers and systems In therest of the paper we focus on the theoretical analysis of allthe methods We would like to understand why our findingsare not aligned with the popular belief that the Winogradmethod should be strictly faster

5 Performance AnalysisOur experimental observations suggested that some of thestages in both Winogradndash and FFTndashbased approach haverelatively low utilization of the systemrsquos available FLOPSIn most but not all cases these were the transform stageswhich have relatively small amount of compute but accessrelatively large amount of data suggesting that their run-ning time is bound by the memory bandwidth and not thecomputational capabilities of the CPU

For this reason we used the Roofline [33] performancemodel to analyze Winogradndash and FFTndash based convolutions

RooflinePerformanceModel is an ISA oblivious through-put oriented performance model used to estimate perfor-mance of an application running on processing units and isoften used to predict performances on CPUs GPUs TPUs(Tensor Processing Units) etc

It is suitable for analyzing particular methods on systemswhere the memory bandwidth often becomes the constrain-ing resource It estimates the performance bound of an algo-rithm as a function of its arithmetic intensity (AI) which isdefined as the ratio of total floatingndashpoint operations (FPO)and the total data movement (DM) in bytes (AI= FPODM)between different levels of the memory hierarchy For eachlevel of memory hierarchy the performance ceiling (attain-able FLOPS) is determined by

Attainable FLOPS = min(Peak FLOPS MB timesAI ) (7)

Where Peak FLOPS is defined as the maximal numberof floating operations per second of a given system andMB as systemrsquos peak memory bandwidth When plotted theperformance ceiling line resembles a rooflineHere we are interesting in the DM between the highest

level of onndashchip corendashexclusive cache (typically L2 for CPUs)and offndashchip shared memory 3 In a systems where the L2 isshared among a small number of cores (eg Intel Xeon Phiseries share L2 cache between two cores) the L2 is assumedto be equally divided for exclusive usage among the cores

DM then accounts for all of regular and streaming storesto main memory regular loads from main memory and prendashfetches from main memory to either L1 or L2Our analysis does not consider the presence of higher

level shared caches as on modern processors the sizes of

3The term offndashchip refers to the main memory of the system typically largebut much lower throughput and higher latency than the onndashchip cachesThe HBWMCDRAM of the Knights Landing processor would be consideredoffndashchip

5

such caches are very limited and can not fit the working setof a typical convolutional layers

Additional performance ceilings for lower levels of cacheare not necessary since in both Winogradndashbased and FFTndashbased convolutional algorithms the computation is eitherbounded by transfers to and from main memory or the pro-cessor peak FLOPS as shown in the following sections

As the performance ceiling is set by algorithmrsquos arithmeticintensity (Eqn 7) its running time can be estimated by

runninд t ime =F PO

Attainable FLOPS

=F PO

min(Peak FLOPS MB times AI ) =F POMB

min(CMR AI )

=

FPO

Peak FLOPS CMR lt= AI

DMMB CMR gt AI

(8)Here CMR is the systemrsquos computendashtondashmemory ratio de-fined as the ratio of itrsquos Peak FLOPS and MB ndash the mem-ory bandwidth FPO represents the total number of floatingpoint operations required and DM (the total amount of datamovements in bytes) as defined above The running timeis compute bound when CMR lt= AI in which case it canbe computed as F PO

Peak FLOPS otherwise the running time ismemory bound and can be computed as DM

MB Estimating running time For the Wingoradndash and FFTndashbased convolutional layers where the computation is com-posed of several sequential stages (S) each with a uniquearithmetic intensity (AI) the running time is calculated byaccumulating the running times of each stage s isin S

runninд timeTotal =Ssum

runninд times =Ssum FPOsMB

min(CMRAIs )(9)

51 Relative Performances

We are interested in the relative performance betweenWino-grad and FFTmethods We define Speedup(AB) as the ratioof the running times of an algorithm B and an algorithm A

Speedup(AB) =runninд timeBTotalrunninд timeATotal

(10)

A speedup greater than one indicates that the algorithm Ais faster and smaller than one means that the algorithm Bis fasterWhile Eqn 9 estimates the running time of an algorithm

assuming perfect utilization of the hardware which is rarelypossible in practice However the Eqn 10 is also valid whenthe compared algorithms are equally optimized meaningthat they utilize the same percentage of the hardware capa-bilitiesIn the Eqn 10 the value of AI will differ between Wino-

grad and FFT based approaches and will also depend on theamount of available cache size [18] Therefore when compar-ing performance of two algorithms the relative performanceon a given system will depend on the CMR ratio and the

amount of available cache but not on the absolute values ofthe systemrsquos compute speed or memory throughput

Detailed analysis on obtaining the values of AI DMand FPO are presented in the Appendix A

Using Eqn 10 and the values of AI DM and FPO calculatedin the Appendix A (Tbl 2) we estimate the speedup betweenthe FFTndashbased methods and the Winogradndashbased one Forall three methods the tile sizes are chosen in a way thatminimizes the total running time (Eqn 9)

52 The Accuracy of the Theoretical Estimates

In Fig 3 we plot the estimated theoretical speedup of thetwo FFTndashbased methods over the Winogradndashbased one Thesolid lines represent the theoretical estimates of the rela-tive performances as a function of the systemrsquos CMR Thecolor of the line represents the amount of available L2 cacheThe lines are drawn in a semindashtranslucent fashion as theyoverlap in many placesThe empirical results from the benchmarks described in

Section 4 are overlaid each crossndashhair represents the mea-surement of the relative performances on a single systemThe x coordinate of the cross-hairs represents the systemrsquoscomputendashtondashmemory (CMR) ratio (see Tbl 1) and the colorrepresents the L2 cache sizesOur empirical results are consistent with our theoreti-

cal estimates The overall relative root mean square error(rRMSE) was 0079 for RegularndashFFT vs Winograd and 01for GaussndashFFT vs Winograd The fitness 4 was 9268 forRegularndashFFT vs Winograd and 90 for GaussndashFFT vs Wino-grad

53 Discussions

Hardware utilization We have also measured the systemutilization of each stage of the three methods While theutilization varied across benchmarked layers and systemson average during the compute bound stages 75 of theo-retical peak FLOPS were attained in memory bound stagesslightly more than 85 of the theoretical memory bandwidthwas achieved This resulted in a somehow lower effectiveCMR which is consistent with the results shown in Fig 3The empirical results are slightly shifted to the left (towardslower values of CMR) when compared to the theoretical pre-dictions which assume equal utilization of both the FLOPSand the memory bandwidth

Optimality of Tile Transforms Both ldquowincnnrdquo [7] andldquogenfftrdquo [13] use heuristics to minimize the number of opera-tions for performing transforms and might not be optimalHowever as their AIs are much smaller than the CMRs ofmodern systems the computation is memory bound andthe running time of the transform stages will depend solelyon the amount of data movement The largest AI of the FFTtransforms is 555 and for Winograd 238 much lower thanCMRs of the modern CPUs The Xeon Phi Knights Landingprocessor has the CMR of 11 (due to onndashchip fast MCDRAM)

4fitness = 1001+rRMSE

6

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 4: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Table 1 Machine configurations used for benchmarking The MBrepresents the theoretical peak memory bandwidth and CMR rep-resents the ratio between the available FLOPS and the memorybandwidth

CPU Cores GFLOPS AVX Cache MB(GBs) CMR

Xeon Phi 7210 64 4506 512 512 KB 4096 11i7-6950X 10 960 2 1 MB 683 1406i9-7900X 10 2122 512 1 MB 96 22Xeon Gold 6148 20 3072 512 1 MB 128 24E7-8890v3 18 1440 2 256 KB 512 2813Xeon Platinum 8124M 18 3456 512 1MB 1152 30i9-7900X 10 2122 512 1 MB 683 31Xeon Phi 7210 48 4506 512 512 KB 1024 33Xeon Phi 7210 64 4506 512 512 KB 1024 3911i9-7900X 10 2122 512 1 MB 512 4125

that are powers of two [9 24] our implementations supportarbitrary sizesElementndashWise Multiplications For the elementndashwisestage where matrixndashmatrix multiplications are performedthe implementation of [18] provides JIT routines for realndashmatrix multiplications optimized for AVX512 instruction setFollowing the same principles of [18] we implemented JITrealndashmatrix multiplication routines optimized for the AVX2instruction set as well as complexndashnumber multiplicationroutines for both AVX512 and AVX2 instruction sets whichare required for the RegularndashFFT methodParallelization Through Static Scheduling Each of thestages of our algorithm is parallelized using static schedulingoriginally proposed in [38] using the generalized implemen-tation provided by [18] To achieve optimal performanceeach core is assigned roughly the same amount of compu-tation The work is then executed using a single forkndashjoinroutine

4 Performance ComparisonsTo compare running times of the FFT and Winograd meth-ods we benchmarked our FFTndashbased implementations andthe improved Winograd implementation of Jia et al [18] byusing the two most popular ConvNets AlexNet [20] andVGG [28] which are frequently used for benchmarking [4]We benchmarked the time required for the forward propaga-tion of each distinct layers of the two networksAdditional publicly available implementations were in-

cluded for reference ndash The Winograd implementations pro-vided by LIBXSMM [6] and MKL-DNN [2] and the directconvolution provided by MKL-DNN [2] To the best of ourknowledge no other implementation of FFTndashbased methodsfor CPUs beside our own is currently available

Both Winograd and FFT methods can work with arbitrarytransform sizes Generally larger transform size decrease thenumber of required operations However the numerical in-accuracy of Winograd convolutions increases exponentiallywith transform (tile) sizes [18 21 26] In practice there isan upper bound on the transform size for which the Wino-grad produces satisfactory results All major vendors suchas FALCON MKL-DNN LIBXSMM and cuDNN [1 2 6 9]

implement theWinograd algorithm with transform sizes lessthan or equal to 6times 6 2 For these tile sizes the numerical er-ror of computation is comparable to the one that implementsconvolution directlyWe follow the same convention and consider Winograd

convolutions only with tile sizes of at most 6 times 6 Howeverwe allow the FFT methods to have an arbitrary large tile sizeas they donrsquot suffer from such numerical instability

Running Time Experiments The benchmarks were per-formed on a total of 10 systems showed in Tbl 1In Fig 1 we show detailed results on one of the systems

Note that both LIBXSMMrsquos and MKL-DNNrsquos Winograd im-plementation support only kernel sizes of 3times3 and thuscan not use the Winograd method for the second layer ofAlexNet

The Winogradndashbased method outperformed in only 3 outof 12 layers whereas the a FFTndashbased method outperformedon 6 layers and on 3 layers they had roughly the same per-formances More importantly in the cases when the FFTmethods outperformed they did it with a larger marginand on the layers that require more computation time Thissuggests that the FFT methods can allow for significant sav-ings in the overall computation time when compared to theWinograd For instance the time spent on all convolutionallayers in AlexNet using the Winograd method would con-sume 5879 milliseconds whereas the ReguralndashFFT methodrequires only 3196 milliseconds that is a speedup of 184xAdditionally in Fig 2 we show the normalized running

time on all other AVX512 systems (neither LIBXSMM norMKL-DNN support the AVX2 instruction set) The resultsfor each individual layer are scaled based on the slowestimplementation as we are only interesting in the relativeperformances Detailed results are available in Appendix Inall cases our two FFTndashbased implementations as well as themodified Winograd implementation of [18] outperformedother publicly available implementations Thus for the restof the paper we will only focus on these three implementa-tions

FFT transform sizes An important observation was thatthe optimal transform sizes for the FFT method were not al-ways powers of two they were 27 for VGG12 25 for VGG21and VGG22 21 for VGG 31 and VGG32 16 for VGG41 andVGG42 and 9 for VGG5152 For AlexNet the optimal sizeswere 31 for the second layer and 15 for all other layers Thisis counter intuitive as the common belief is that optimal FFTtransforms should use sizes that only contain factors thatare small primes [13 37] or transforms with sizes equal tothe powers of two which is suggested by the two GPU FFTndashbased implementations ie fbfft [24 31] and CuDNN [11]

2With the transform sizes of 6times 6 the average numerical error of the Wino-grad method on benchmarked layers was 703 middot 10minus6 which is similar to theerror of directndashconvolution 111 middot10minus6 When the transform size is increasedto 8 times 8 the error is increased to 124 middot 10minus3 which was expected [18] Thenumerical error of the FFTndashbased method was no larger than 288 middot 10minus7regardless of the tile size

4

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 4

935

253

620

63

809

702

672

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

Figure 1 Convolution layersrsquo running time on Xeon Gold CPU

7210

(11)

i9minus

7900

X(2

2)i9

minus79

00X

(31)

7210

(33)

7210

(39

11)

AlexNet

2

AlexNet

3

AlexNet

4

AlexNet

5

VGG 12

VGG 21

VGG 22

VGG 31

VGG 32

VGG 41

VGG 42

VGG 51

52

i9minus

7900

x(4

125

)

RegularminusFFT

GaussminusFFT

Winograd (Jia et al)

MKLminusDNN Win

MKLminusDNN Direct

LIBXSMM Win

Figure 2 Normalized running time of different convolution imple-mentations (The systems with the same CPUs are distinguishedby their CMR )

Relative performances While an FFTndashmethod outper-formed the Winograd more often than not the relative per-formances varied among different layers and systems In therest of the paper we focus on the theoretical analysis of allthe methods We would like to understand why our findingsare not aligned with the popular belief that the Winogradmethod should be strictly faster

5 Performance AnalysisOur experimental observations suggested that some of thestages in both Winogradndash and FFTndashbased approach haverelatively low utilization of the systemrsquos available FLOPSIn most but not all cases these were the transform stageswhich have relatively small amount of compute but accessrelatively large amount of data suggesting that their run-ning time is bound by the memory bandwidth and not thecomputational capabilities of the CPU

For this reason we used the Roofline [33] performancemodel to analyze Winogradndash and FFTndash based convolutions

RooflinePerformanceModel is an ISA oblivious through-put oriented performance model used to estimate perfor-mance of an application running on processing units and isoften used to predict performances on CPUs GPUs TPUs(Tensor Processing Units) etc

It is suitable for analyzing particular methods on systemswhere the memory bandwidth often becomes the constrain-ing resource It estimates the performance bound of an algo-rithm as a function of its arithmetic intensity (AI) which isdefined as the ratio of total floatingndashpoint operations (FPO)and the total data movement (DM) in bytes (AI= FPODM)between different levels of the memory hierarchy For eachlevel of memory hierarchy the performance ceiling (attain-able FLOPS) is determined by

Attainable FLOPS = min(Peak FLOPS MB timesAI ) (7)

Where Peak FLOPS is defined as the maximal numberof floating operations per second of a given system andMB as systemrsquos peak memory bandwidth When plotted theperformance ceiling line resembles a rooflineHere we are interesting in the DM between the highest

level of onndashchip corendashexclusive cache (typically L2 for CPUs)and offndashchip shared memory 3 In a systems where the L2 isshared among a small number of cores (eg Intel Xeon Phiseries share L2 cache between two cores) the L2 is assumedto be equally divided for exclusive usage among the cores

DM then accounts for all of regular and streaming storesto main memory regular loads from main memory and prendashfetches from main memory to either L1 or L2Our analysis does not consider the presence of higher

level shared caches as on modern processors the sizes of

3The term offndashchip refers to the main memory of the system typically largebut much lower throughput and higher latency than the onndashchip cachesThe HBWMCDRAM of the Knights Landing processor would be consideredoffndashchip

5

such caches are very limited and can not fit the working setof a typical convolutional layers

Additional performance ceilings for lower levels of cacheare not necessary since in both Winogradndashbased and FFTndashbased convolutional algorithms the computation is eitherbounded by transfers to and from main memory or the pro-cessor peak FLOPS as shown in the following sections

As the performance ceiling is set by algorithmrsquos arithmeticintensity (Eqn 7) its running time can be estimated by

runninд t ime =F PO

Attainable FLOPS

=F PO

min(Peak FLOPS MB times AI ) =F POMB

min(CMR AI )

=

FPO

Peak FLOPS CMR lt= AI

DMMB CMR gt AI

(8)Here CMR is the systemrsquos computendashtondashmemory ratio de-fined as the ratio of itrsquos Peak FLOPS and MB ndash the mem-ory bandwidth FPO represents the total number of floatingpoint operations required and DM (the total amount of datamovements in bytes) as defined above The running timeis compute bound when CMR lt= AI in which case it canbe computed as F PO

Peak FLOPS otherwise the running time ismemory bound and can be computed as DM

MB Estimating running time For the Wingoradndash and FFTndashbased convolutional layers where the computation is com-posed of several sequential stages (S) each with a uniquearithmetic intensity (AI) the running time is calculated byaccumulating the running times of each stage s isin S

runninд timeTotal =Ssum

runninд times =Ssum FPOsMB

min(CMRAIs )(9)

51 Relative Performances

We are interested in the relative performance betweenWino-grad and FFTmethods We define Speedup(AB) as the ratioof the running times of an algorithm B and an algorithm A

Speedup(AB) =runninд timeBTotalrunninд timeATotal

(10)

A speedup greater than one indicates that the algorithm Ais faster and smaller than one means that the algorithm Bis fasterWhile Eqn 9 estimates the running time of an algorithm

assuming perfect utilization of the hardware which is rarelypossible in practice However the Eqn 10 is also valid whenthe compared algorithms are equally optimized meaningthat they utilize the same percentage of the hardware capa-bilitiesIn the Eqn 10 the value of AI will differ between Wino-

grad and FFT based approaches and will also depend on theamount of available cache size [18] Therefore when compar-ing performance of two algorithms the relative performanceon a given system will depend on the CMR ratio and the

amount of available cache but not on the absolute values ofthe systemrsquos compute speed or memory throughput

Detailed analysis on obtaining the values of AI DMand FPO are presented in the Appendix A

Using Eqn 10 and the values of AI DM and FPO calculatedin the Appendix A (Tbl 2) we estimate the speedup betweenthe FFTndashbased methods and the Winogradndashbased one Forall three methods the tile sizes are chosen in a way thatminimizes the total running time (Eqn 9)

52 The Accuracy of the Theoretical Estimates

In Fig 3 we plot the estimated theoretical speedup of thetwo FFTndashbased methods over the Winogradndashbased one Thesolid lines represent the theoretical estimates of the rela-tive performances as a function of the systemrsquos CMR Thecolor of the line represents the amount of available L2 cacheThe lines are drawn in a semindashtranslucent fashion as theyoverlap in many placesThe empirical results from the benchmarks described in

Section 4 are overlaid each crossndashhair represents the mea-surement of the relative performances on a single systemThe x coordinate of the cross-hairs represents the systemrsquoscomputendashtondashmemory (CMR) ratio (see Tbl 1) and the colorrepresents the L2 cache sizesOur empirical results are consistent with our theoreti-

cal estimates The overall relative root mean square error(rRMSE) was 0079 for RegularndashFFT vs Winograd and 01for GaussndashFFT vs Winograd The fitness 4 was 9268 forRegularndashFFT vs Winograd and 90 for GaussndashFFT vs Wino-grad

53 Discussions

Hardware utilization We have also measured the systemutilization of each stage of the three methods While theutilization varied across benchmarked layers and systemson average during the compute bound stages 75 of theo-retical peak FLOPS were attained in memory bound stagesslightly more than 85 of the theoretical memory bandwidthwas achieved This resulted in a somehow lower effectiveCMR which is consistent with the results shown in Fig 3The empirical results are slightly shifted to the left (towardslower values of CMR) when compared to the theoretical pre-dictions which assume equal utilization of both the FLOPSand the memory bandwidth

Optimality of Tile Transforms Both ldquowincnnrdquo [7] andldquogenfftrdquo [13] use heuristics to minimize the number of opera-tions for performing transforms and might not be optimalHowever as their AIs are much smaller than the CMRs ofmodern systems the computation is memory bound andthe running time of the transform stages will depend solelyon the amount of data movement The largest AI of the FFTtransforms is 555 and for Winograd 238 much lower thanCMRs of the modern CPUs The Xeon Phi Knights Landingprocessor has the CMR of 11 (due to onndashchip fast MCDRAM)

4fitness = 1001+rRMSE

6

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 5: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 4

935

253

620

63

809

702

672

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

Figure 1 Convolution layersrsquo running time on Xeon Gold CPU

7210

(11)

i9minus

7900

X(2

2)i9

minus79

00X

(31)

7210

(33)

7210

(39

11)

AlexNet

2

AlexNet

3

AlexNet

4

AlexNet

5

VGG 12

VGG 21

VGG 22

VGG 31

VGG 32

VGG 41

VGG 42

VGG 51

52

i9minus

7900

x(4

125

)

RegularminusFFT

GaussminusFFT

Winograd (Jia et al)

MKLminusDNN Win

MKLminusDNN Direct

LIBXSMM Win

Figure 2 Normalized running time of different convolution imple-mentations (The systems with the same CPUs are distinguishedby their CMR )

Relative performances While an FFTndashmethod outper-formed the Winograd more often than not the relative per-formances varied among different layers and systems In therest of the paper we focus on the theoretical analysis of allthe methods We would like to understand why our findingsare not aligned with the popular belief that the Winogradmethod should be strictly faster

5 Performance AnalysisOur experimental observations suggested that some of thestages in both Winogradndash and FFTndashbased approach haverelatively low utilization of the systemrsquos available FLOPSIn most but not all cases these were the transform stageswhich have relatively small amount of compute but accessrelatively large amount of data suggesting that their run-ning time is bound by the memory bandwidth and not thecomputational capabilities of the CPU

For this reason we used the Roofline [33] performancemodel to analyze Winogradndash and FFTndash based convolutions

RooflinePerformanceModel is an ISA oblivious through-put oriented performance model used to estimate perfor-mance of an application running on processing units and isoften used to predict performances on CPUs GPUs TPUs(Tensor Processing Units) etc

It is suitable for analyzing particular methods on systemswhere the memory bandwidth often becomes the constrain-ing resource It estimates the performance bound of an algo-rithm as a function of its arithmetic intensity (AI) which isdefined as the ratio of total floatingndashpoint operations (FPO)and the total data movement (DM) in bytes (AI= FPODM)between different levels of the memory hierarchy For eachlevel of memory hierarchy the performance ceiling (attain-able FLOPS) is determined by

Attainable FLOPS = min(Peak FLOPS MB timesAI ) (7)

Where Peak FLOPS is defined as the maximal numberof floating operations per second of a given system andMB as systemrsquos peak memory bandwidth When plotted theperformance ceiling line resembles a rooflineHere we are interesting in the DM between the highest

level of onndashchip corendashexclusive cache (typically L2 for CPUs)and offndashchip shared memory 3 In a systems where the L2 isshared among a small number of cores (eg Intel Xeon Phiseries share L2 cache between two cores) the L2 is assumedto be equally divided for exclusive usage among the cores

DM then accounts for all of regular and streaming storesto main memory regular loads from main memory and prendashfetches from main memory to either L1 or L2Our analysis does not consider the presence of higher

level shared caches as on modern processors the sizes of

3The term offndashchip refers to the main memory of the system typically largebut much lower throughput and higher latency than the onndashchip cachesThe HBWMCDRAM of the Knights Landing processor would be consideredoffndashchip

5

such caches are very limited and can not fit the working setof a typical convolutional layers

Additional performance ceilings for lower levels of cacheare not necessary since in both Winogradndashbased and FFTndashbased convolutional algorithms the computation is eitherbounded by transfers to and from main memory or the pro-cessor peak FLOPS as shown in the following sections

As the performance ceiling is set by algorithmrsquos arithmeticintensity (Eqn 7) its running time can be estimated by

runninд t ime =F PO

Attainable FLOPS

=F PO

min(Peak FLOPS MB times AI ) =F POMB

min(CMR AI )

=

FPO

Peak FLOPS CMR lt= AI

DMMB CMR gt AI

(8)Here CMR is the systemrsquos computendashtondashmemory ratio de-fined as the ratio of itrsquos Peak FLOPS and MB ndash the mem-ory bandwidth FPO represents the total number of floatingpoint operations required and DM (the total amount of datamovements in bytes) as defined above The running timeis compute bound when CMR lt= AI in which case it canbe computed as F PO

Peak FLOPS otherwise the running time ismemory bound and can be computed as DM

MB Estimating running time For the Wingoradndash and FFTndashbased convolutional layers where the computation is com-posed of several sequential stages (S) each with a uniquearithmetic intensity (AI) the running time is calculated byaccumulating the running times of each stage s isin S

runninд timeTotal =Ssum

runninд times =Ssum FPOsMB

min(CMRAIs )(9)

51 Relative Performances

We are interested in the relative performance betweenWino-grad and FFTmethods We define Speedup(AB) as the ratioof the running times of an algorithm B and an algorithm A

Speedup(AB) =runninд timeBTotalrunninд timeATotal

(10)

A speedup greater than one indicates that the algorithm Ais faster and smaller than one means that the algorithm Bis fasterWhile Eqn 9 estimates the running time of an algorithm

assuming perfect utilization of the hardware which is rarelypossible in practice However the Eqn 10 is also valid whenthe compared algorithms are equally optimized meaningthat they utilize the same percentage of the hardware capa-bilitiesIn the Eqn 10 the value of AI will differ between Wino-

grad and FFT based approaches and will also depend on theamount of available cache size [18] Therefore when compar-ing performance of two algorithms the relative performanceon a given system will depend on the CMR ratio and the

amount of available cache but not on the absolute values ofthe systemrsquos compute speed or memory throughput

Detailed analysis on obtaining the values of AI DMand FPO are presented in the Appendix A

Using Eqn 10 and the values of AI DM and FPO calculatedin the Appendix A (Tbl 2) we estimate the speedup betweenthe FFTndashbased methods and the Winogradndashbased one Forall three methods the tile sizes are chosen in a way thatminimizes the total running time (Eqn 9)

52 The Accuracy of the Theoretical Estimates

In Fig 3 we plot the estimated theoretical speedup of thetwo FFTndashbased methods over the Winogradndashbased one Thesolid lines represent the theoretical estimates of the rela-tive performances as a function of the systemrsquos CMR Thecolor of the line represents the amount of available L2 cacheThe lines are drawn in a semindashtranslucent fashion as theyoverlap in many placesThe empirical results from the benchmarks described in

Section 4 are overlaid each crossndashhair represents the mea-surement of the relative performances on a single systemThe x coordinate of the cross-hairs represents the systemrsquoscomputendashtondashmemory (CMR) ratio (see Tbl 1) and the colorrepresents the L2 cache sizesOur empirical results are consistent with our theoreti-

cal estimates The overall relative root mean square error(rRMSE) was 0079 for RegularndashFFT vs Winograd and 01for GaussndashFFT vs Winograd The fitness 4 was 9268 forRegularndashFFT vs Winograd and 90 for GaussndashFFT vs Wino-grad

53 Discussions

Hardware utilization We have also measured the systemutilization of each stage of the three methods While theutilization varied across benchmarked layers and systemson average during the compute bound stages 75 of theo-retical peak FLOPS were attained in memory bound stagesslightly more than 85 of the theoretical memory bandwidthwas achieved This resulted in a somehow lower effectiveCMR which is consistent with the results shown in Fig 3The empirical results are slightly shifted to the left (towardslower values of CMR) when compared to the theoretical pre-dictions which assume equal utilization of both the FLOPSand the memory bandwidth

Optimality of Tile Transforms Both ldquowincnnrdquo [7] andldquogenfftrdquo [13] use heuristics to minimize the number of opera-tions for performing transforms and might not be optimalHowever as their AIs are much smaller than the CMRs ofmodern systems the computation is memory bound andthe running time of the transform stages will depend solelyon the amount of data movement The largest AI of the FFTtransforms is 555 and for Winograd 238 much lower thanCMRs of the modern CPUs The Xeon Phi Knights Landingprocessor has the CMR of 11 (due to onndashchip fast MCDRAM)

4fitness = 1001+rRMSE

6

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 6: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

such caches are very limited and can not fit the working setof a typical convolutional layers

Additional performance ceilings for lower levels of cacheare not necessary since in both Winogradndashbased and FFTndashbased convolutional algorithms the computation is eitherbounded by transfers to and from main memory or the pro-cessor peak FLOPS as shown in the following sections

As the performance ceiling is set by algorithmrsquos arithmeticintensity (Eqn 7) its running time can be estimated by

runninд t ime =F PO

Attainable FLOPS

=F PO

min(Peak FLOPS MB times AI ) =F POMB

min(CMR AI )

=

FPO

Peak FLOPS CMR lt= AI

DMMB CMR gt AI

(8)Here CMR is the systemrsquos computendashtondashmemory ratio de-fined as the ratio of itrsquos Peak FLOPS and MB ndash the mem-ory bandwidth FPO represents the total number of floatingpoint operations required and DM (the total amount of datamovements in bytes) as defined above The running timeis compute bound when CMR lt= AI in which case it canbe computed as F PO

Peak FLOPS otherwise the running time ismemory bound and can be computed as DM

MB Estimating running time For the Wingoradndash and FFTndashbased convolutional layers where the computation is com-posed of several sequential stages (S) each with a uniquearithmetic intensity (AI) the running time is calculated byaccumulating the running times of each stage s isin S

runninд timeTotal =Ssum

runninд times =Ssum FPOsMB

min(CMRAIs )(9)

51 Relative Performances

We are interested in the relative performance betweenWino-grad and FFTmethods We define Speedup(AB) as the ratioof the running times of an algorithm B and an algorithm A

Speedup(AB) =runninд timeBTotalrunninд timeATotal

(10)

A speedup greater than one indicates that the algorithm Ais faster and smaller than one means that the algorithm Bis fasterWhile Eqn 9 estimates the running time of an algorithm

assuming perfect utilization of the hardware which is rarelypossible in practice However the Eqn 10 is also valid whenthe compared algorithms are equally optimized meaningthat they utilize the same percentage of the hardware capa-bilitiesIn the Eqn 10 the value of AI will differ between Wino-

grad and FFT based approaches and will also depend on theamount of available cache size [18] Therefore when compar-ing performance of two algorithms the relative performanceon a given system will depend on the CMR ratio and the

amount of available cache but not on the absolute values ofthe systemrsquos compute speed or memory throughput

Detailed analysis on obtaining the values of AI DMand FPO are presented in the Appendix A

Using Eqn 10 and the values of AI DM and FPO calculatedin the Appendix A (Tbl 2) we estimate the speedup betweenthe FFTndashbased methods and the Winogradndashbased one Forall three methods the tile sizes are chosen in a way thatminimizes the total running time (Eqn 9)

52 The Accuracy of the Theoretical Estimates

In Fig 3 we plot the estimated theoretical speedup of thetwo FFTndashbased methods over the Winogradndashbased one Thesolid lines represent the theoretical estimates of the rela-tive performances as a function of the systemrsquos CMR Thecolor of the line represents the amount of available L2 cacheThe lines are drawn in a semindashtranslucent fashion as theyoverlap in many placesThe empirical results from the benchmarks described in

Section 4 are overlaid each crossndashhair represents the mea-surement of the relative performances on a single systemThe x coordinate of the cross-hairs represents the systemrsquoscomputendashtondashmemory (CMR) ratio (see Tbl 1) and the colorrepresents the L2 cache sizesOur empirical results are consistent with our theoreti-

cal estimates The overall relative root mean square error(rRMSE) was 0079 for RegularndashFFT vs Winograd and 01for GaussndashFFT vs Winograd The fitness 4 was 9268 forRegularndashFFT vs Winograd and 90 for GaussndashFFT vs Wino-grad

53 Discussions

Hardware utilization We have also measured the systemutilization of each stage of the three methods While theutilization varied across benchmarked layers and systemson average during the compute bound stages 75 of theo-retical peak FLOPS were attained in memory bound stagesslightly more than 85 of the theoretical memory bandwidthwas achieved This resulted in a somehow lower effectiveCMR which is consistent with the results shown in Fig 3The empirical results are slightly shifted to the left (towardslower values of CMR) when compared to the theoretical pre-dictions which assume equal utilization of both the FLOPSand the memory bandwidth

Optimality of Tile Transforms Both ldquowincnnrdquo [7] andldquogenfftrdquo [13] use heuristics to minimize the number of opera-tions for performing transforms and might not be optimalHowever as their AIs are much smaller than the CMRs ofmodern systems the computation is memory bound andthe running time of the transform stages will depend solelyon the amount of data movement The largest AI of the FFTtransforms is 555 and for Winograd 238 much lower thanCMRs of the modern CPUs The Xeon Phi Knights Landingprocessor has the CMR of 11 (due to onndashchip fast MCDRAM)

4fitness = 1001+rRMSE

6

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 7: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

minus2

0

2

4

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

06

10

14

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

05

10

15

0 20 40 60

VGG 12

04

08

12

16

0 20 40 60

VGG 21

04

08

12

16

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

RegularminusFFT vs Winograd

Spe

edup

minus25

00

25

0 20 40 60

AlexNet 2

04

08

12

16

0 20 40 60

AlexNet 3

04

08

12

16

0 20 40 60

AlexNet 4

04

08

12

16

0 20 40 60

AlexNet 5

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

GaussminusFFT vs Winograd

Spe

edup

Compute to Memory Ratio (FLOPsByte)

Empirical Theoretical L2 Cache size 256 kb 512 kb 1024 kb

Figure 3 Theoretical estimates and empirical measurements for the speedup of Regularndash and GaussndashFFT methods over the Winogradmethod on VGG and AlexNet The x axis represent the systemrsquos CMR

and the Xeon Server processor family has CMRs in the rangeof 20ndash40 Hence our theoretical analysis yield identical esti-mates as if the optimal number of operations were provided

This is consistent with our experimental findings where insome cases tile sizes that were large primes such as 31 wereoptimal Using such sizes the images could be divided intooverlapping tiles with minimal overhead (minimal paddingof the original image) which reduces both the number ofrequired operations and the amount of required data move-ments

6 Conclusions

This paper presents experimental and analytical findingsthat FFTndashbased convolutions are on average faster thanWinogradndashbased convolutions on modern CPUs

In contrast to the popular belief that the Winogradndashbasedmethod is more efficient our experimental results of a highlyoptimized Winogradndashbased implementation and two simi-larly optimized FFT-based implementations onmodern CPUsshow that the FFT convolutions are more often than not

7

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 8: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

faster than Wingorad ones for most commonly used Con-vNet layers

Our analysis using the Roofline model shows that whetherthe Winogradndash or FFTndashbased approach is faster depends onboth the layer and target hardware However with the ten-dency of increasing computendashtondashmemory ratio of systemswith modern CPUs the FFTndashbased methods tend to be fasterthan Winograd

While we primarily focused onmodern multindash and manyndashcore CPUs which generally have larger caches but smallermemory bandwidths than modern GPUs we believe that ourperformance analysis can be applied to GPUs Future workmight include implementing and benchmarking efficientGPUndashbased implementations and validating performanceanalysis based on the Roofline model

References[1] 2016 FALCON Library Fast Image Convolution in Neural Networks

on Intel Architecture httpscolfaxresearchcomfalcon-library(2016)

[2] 2016 Intel(R) Math Kernel Library for Deep Neural Networks httpsgithubcom01orgmkl-dnn (2016)

[3] 2018 N-Dimensional Winogradndashbased convolution frameworkhttpsbitbucketorgpoozhond-winograd (2018)

[4] Accessed 08-15-2018 Imagenet Winners Benchmarking httpsgithubcomsoumithconvnet-benchmarks (Accessed 08-15-2018)

[5] Accessed 08-15-2018 Intelreg Nervana reference deep learning frame-work httpsgithubcomNervanaSystemsneon (Accessed 08-15-2018)

[6] Accessed 08-15-2018 LIBXSMM httpsgithubcomhfplibxsmm(Accessed 08-15-2018)

[7] Accessed 08-15-2018 Wincnn httpsgithubcomandravinwincnn(Accessed 08-15-2018)

[8] Accessed 08-15-2018 Winograd and FFT Convolution httpsbitbucketorgjiazhentimwinograd-fft-convsrcmaster (Accessed08-15-2018)

[9] JeremyAppleyard Accessed 08-15-2018 Optimizing Recurrent NeuralNetworks in cuDNN 5 httpsdevblogsnvidiacomparallelforalloptimizing-recurrent-neural-networks-cudnn-5 (Accessed 08-15-2018)

[10] David Budden Alexander Matveev Shibani Santurkar Shraman RayChaudhuri and Nir Shavit 2016 Deep Tensor Convolution on Multi-cores arXiv preprint arXiv161106565 (2016)

[11] Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Co-hen John Tran Bryan Catanzaro and Evan Shelhamer 2014 cudnnEfficient primitives for deep learning arXiv preprint arXiv14100759(2014)

[12] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Li Fei-Fei2009 Imagenet A large-scale hierarchical image database In ComputerVision and Pattern Recognition 2009 CVPR 2009 IEEE Conference onIEEE 248ndash255

[13] Matteo Frigo and Steven G Johnson 1998 FFTW An adaptive softwarearchitecture for the FFT In Acoustics Speech and Signal Processing1998 Proceedings of the 1998 IEEE International Conference on Vol 3IEEE 1381ndash1384

[14] Dennis Gannon William Jalby and Kyle Gallivan 1988 Strate-gies for Cache and Local Memory Management by Global ProgramTransformation In Proceedings of the 1st International Conference onSupercomputing Springer-Verlag London UK UK 229ndash254 httpdlacmorgcitationcfmid=647970761024

[15] Alexander Heinecke Greg Henry Maxwell Hutchinson and HansPabst 2016 LIBXSMM accelerating small matrix multiplications byruntime code generation In High Performance Computing NetworkingStorage and Analysis SC16 International Conference for IEEE 981ndash991

[16] Alexander Heinecke Hans Pabst and Greg Henry 2015 Libxsmm Ahigh performance library for small matrix multiplications Poster andExtended Abstract Presented at SC (2015)

[17] James Jeffers James Reinders and Avinash Sodani 2016 Intel XeonPhi Processor High Performance Programming Knights Landing EditionMorgan Kaufmann

[18] Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li 2018 Optimiz-ing N-Dimensional Winograd-Based Convolution for Manycore CPUsIn Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming

[19] Tamara G Kolda and Brett W Bader 2009 Tensor decompositions andapplications SIAM review 51 3 (2009) 455ndash500

[20] Alex Krizhevsky Ilya Sutskever andGeoffrey EHinton 2012 Imagenetclassification with deep convolutional neural networks In Advancesin neural information processing systems 1097ndash1105

[21] Andrew Lavin and Scott Gray 2016 Fast algorithms for convolutionalneural networks In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 4013ndash4021

[22] Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun and RichardVuduc 2015 An input-adaptive and in-place approach to dense tensor-times-matrix multiply In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisACM 76

[23] M Donald MacLaren 1970 The art of computer programming Volume2 Seminumerical algorithms (Donald E Knuth) SIAM Rev 12 2 (1970)306ndash308

[24] Michael Mathieu Mikael Henaff and Yann LeCun 2013 Fast trainingof convolutional networks through ffts arXiv preprint arXiv13125851(2013)

[25] Guido F Montufar Razvan Pascanu Kyunghyun Cho and YoshuaBengio 2014 On the number of linear regions of deep neural networksIn Advances in neural information processing systems 2924ndash2932

[26] Victor Y Pan 2016 How bad are Vandermonde matrices SIAM JMatrix Anal Appl 37 2 (2016) 676ndash694

[27] Lawrence R Rabiner and Bernard Gold 1975 Theory and applicationof digital signal processing Englewood Cliffs NJ Prentice-Hall Inc1975 777 p (1975)

[28] Karen Simonyan and Andrew Zisserman 2014 Very deep convo-lutional networks for large-scale image recognition arXiv preprintarXiv14091556 (2014)

[29] Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott ReedDragomir Anguelov Dumitru Erhan Vincent Vanhoucke and AndrewRabinovich 2015 Going deeper with convolutions In Proceedings ofthe IEEE conference on computer vision and pattern recognition 1ndash9

[30] Vincent Vanhoucke Andrew Senior and Mark ZMao 2011 Improvingthe speed of neural networks on CPUs In Proc Deep Learning andUnsupervised Feature Learning NIPS Workshop Vol 1 4

[31] Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith ChintalaSerkan Piantino and Yann LeCun 2014 Fast convolutional nets withfbfft A GPU performance evaluation arXiv preprint arXiv14127580(2014)

[32] Kevin Vincent Kevin Stephano Michael Frumkin Boris Ginsburgand Julien Demouth 2017 On Improving the Numerical Stability ofWinograd Convolutions (2017)

[33] Samuel Williams David Patterson Leonid Oliker John Shalf andKatherine Yelick 2008 The roofline model A pedagogical tool forauto-tuning kernels on multicore architectures In Hot Chips Vol 2024ndash26

[34] ShmuelWinograd 1980 Arithmetic complexity of computations Vol 33Siam

[35] Wm A Wulf and Sally A McKee 1995 Hitting the memory wallimplications of the obvious ACM SIGARCH computer architecturenews 23 1 (1995) 20ndash24

[36] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNndashAFast and Scalable Algorithm for Training 3D Convolutional Networkson Multi-core and Many-Core Shared Memory Machines In Paralleland Distributed Processing Symposium 2016 IEEE International IEEE

8

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 9: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

801ndash811[37] Aleksandar Zlateski Kisuk Lee and H Sebastian Seung 2016 ZNNi

maximizing the inference throughput of 3D convolutional networks onCPUs and GPUs In High Performance Computing Networking Storageand Analysis SC16 International Conference for IEEE 854ndash865

[38] Aleksandar Zlateski and H Sebastian Seung 2017 Compile-time opti-mized and statically scheduled ND convnet primitives for multi-coreand many-core (Xeon Phi) CPUs In Proceedings of the InternationalConference on Supercomputing ACM 8

A Detailed AnalysisFor simplicity we assume 2D and isotropic images and ker-nels as well as computation being performed with 32ndashbitfloating point numbers (4 bytes per number) Extension tononndashisotropic and Nndashdimensional images and kernels aswell as lower precision floats is straightforward The bench-marked implementations described in Section 3 support anarbitrary number of dimensions as well as nonndashisotropickernelsWe follow the same notation as in Section 2 and for

an arbitrary layer use Winograd convolution F(m2 r 2)RegularndashFFT convolutionF(m2 r 2) and GaussndashFFT convolu-tionG(m2 r 2) Herem can take an arbitrary value whereasr 2 has to equal to the sizes of the kernels in the layer

We proceed to present the details of all three methods andestimate the total number of operations and data movementrequired in each stage from which we also calculate thearithmetic intensity of each of the three method

A1 Input Transform Stage

In Winograd and both FFT approaches each of the B middot Cimages (B batches each with C input channels) is dividedinto overlapping tiles which are then transformed Each tilehas the size of t2 with t = (m+ r minus 1) and there is an overlapof r minus 1 pixels between adjacent tiles along both dimensionsEach image of size x times x will thus be divided into N =

lceil(xminusr+1)mrceil2 tiles If necessary the image will be implicitlyzerondashpadded This yields a total of BC lceil(xminusr+1)mrceil2 = BCNimage tiles

Number of required operations Each tile (Ibcn ) is trans-formed via Ibcn = BIbcnBT which is a composition of twomatrixndashmatrix multiplications Here Ibcn is the nndashth tile ofthe cndashth channel in batchb and Ibcn is its transform BothBand Ibcn have the size t2 requiring a total of 4t3 operations(2t3 per matrixndashmatrix multiplication) Operations are onreal numbers for the Winograd approach and on complexnumbers for the FFT approachesThe transforms can however be performed with much

fewer operations In the Winograd method the matrices Bare sparse and contain pairs of columns with similar num-bers allowing for reuse of many intermediate results Inthe FFT methods instead of matrix multiplications 2D realndashtondashcomplex DFT transforms are performed which requiremuch fewer operations (micron2 logn instead of O(n3)) Here theconstant micro can vary significantly based on the size of thetransform performed [13] It takes small values when the

size of the transform is a power of two and larger valueswhen the transform size contains large prime factors

Instead of using a closedndashform expression that gives boundson the number of operations required we counted the num-ber of operations in real optimized implementations andstored them in a lookup tables For Winograd we used theWinograd matrix generator [7] as proposed in [10 18] andfurther reduced the number of operations using the simpleoptimizer provided by [18] For the FFTmethods we used theFFT codelet generator ldquogenfftrdquo from the FFTW library [13]We denote the number of operations required for transform-ing an image tile as FI (m2 r 2) for the Winograd FI (m2 r 2)for RegularndashFFT and GI (m2 r 2) for GaussndashFFT The prendashcomputed lookup tables are included in our openndashsourcedrepositoryThe total number of operations required for performing

all the transforms for all three methods are given in Tbl 2

Data movement The tile sizes are relatively small muchsmaller than available cache thus once a tile is fetched frommain memory all the computation is done inndashcache Addi-tionally the overlapping regions between tiles need to befetched from memory only once as they can be stored incache and reused for adjacent tiles

The total data that has to be moved from main memory tocache is thus BCx2 middot4 bytes ndash reading each of the B middotC imagesof size x2 only once Each of the BCN transformed tiles isstored back to main memory The size for each transformedtile will depend on the methodIn Winograd convolution transformed tiles are real ten-

sors and require a single 32ndashbit float per element for a totalof 4t2 bytesThe FFT methods perform 2D FFT transforms of each

tile As the FFTs are performed on real tensors the resultingtransformswill be complex conjugatendashsymmetric (along oneof the dimensions) Only t lceil(t + 1)2rceil complex numbers needto be stored The RegularndashFFT requires two real numbersper complex number for a total of 8t lceil(t +1)2rceil bytes per tileand the GaussndashFFT requires three reals per complex numbera total of 12t lceil(t + 1)2rceil bytes per tile

The total amount of data movement for the three methodsas well as their arithmetic intensities (AIs) is shown in Tbl 2

A2 Kernel Transform Stage

In all methods each of the CtimesC prime kernels (with size rtimesr ) istransformed via Wcc prime = GWcc primeG

T The matrix G has size ttimesr Computing the transform of a kernel is an equivalent pro-

cedure to performing the input transform of a zerondashpaddedkernel to match the size of the input tiles In practice thetransforms are computed with implicit zerondashpadding As inthe input stage we have prendashcomputed the number of opera-tions required for transforming a kernel in all three methodsFK (m2 r 2) for theWinogradFK (m2 r 2) for RegularndashFFT Forthe GaussndashFFT methodGK (m2 r 2) = FK (m2 r 2) + 2t lceil(t +1)2rceil as we need two extra operations per complex numberas described in Sec 23

9

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 10: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Table 2 FLOPS DM and AI for different stages of the Winograd (F(m2 r2)) RegularndashFFT (F(m2 r2)) and GaussndashFFT (G(m2 r2)) methodsN is the number of tiles per single image the tiles have size of t2 (t =m + r minus 1) The values of c and c prime depend on the amount of availablecache and are obtained by solving the optimization problem given in Eqn 13

Stage Winograd RegularndashFFT GaussndashFFT

FLOPS Input image transform BCNFI (m2 r2) BCNFI (m2 r2) BCNGI (m2 r2)

Kernel transform CC primeFK (m2 r2) CC primeFK (m2 r2) CC primeGK (m2 r2)Elementndashwise Product 2t2BNCC prime 8t lceil(t + 1)2rceilBNCC prime 6t lceil(t + 1)2rceilBNCC prime

Output transform BC primeNFO (m2 r2) BC primeNFO (m2 r2) BC primeNGO (m2 r2)

DM

Input image transform 4BCx2 + 4BCNt2 4BCx2 + 8BCNt lceil(t + 1)2rceil 4BCx2 + 12BCNt lceil(t + 1)2rceilKernel transform 4CC prime (r2 + t2) 4CC prime (r2 + 2t lceil(t + 1)2rceil) 4CC prime (r2 + 3t lceil(t + 1)2rceil)

Elementndashwise Product 4t2BN (c + αc prime)CCprime

cc prime8t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc prime12t lceil(t + 1)2rceilBN (c + αc prime)CC

prime

cc primeOutput transform 4BC primeN

(t2 +m2) 4BC primeN

(2t lceil(t + 1)2rceil +m2) 4BC primeN

(3t lceil(t + 1)2rceil +m2)

AI Input image transform

NFI (m2 r2)4x2 + 4Nt2

NFI (m2 r2)4x2 + 8Nt lceil(t + 1)2rceil

NGI (m2 r2)4x2 + 12Nt lceil(t + 1)2rceil

Kernel transformFK (m2 r2)4r2 + 4t2

FK (m2 r2)4r2 + 8t lceil(t + 1)2rceil

GK (m2 r2)4r2 + 12t lceil(t + 1)2rceil

Elementndashwise Productcc prime

2(c + αc prime)cc prime

c + αc primecc prime

2(c + αc prime)Output transform

FO (m2 r2)4t2 + 4m2

FO (m2 r2)8t lceil(t + 1)2rceil + 4m2

GO (m2 r2)12t lceil(t + 1)2rceil + 4m2

All three methods need to fetch all kernel data from mem-ory read a total of 4CC primer 2 bytes and store CC prime transformedkernels back to main memory The transformed kernels havesize t times t and are stored in the same fashion as the trans-formed input tiles requiring total of 4t2 8t lceil(t + 1)2rceil and12t lceil(t + 1)2rceil bytes for the Winograd RegularndashFFT andGaussndashFFT methods respectively

The total number of operations data movements and AIsfor transforming all kernels of a particular layer using anyof the three methods are given in Tbl 2A3 Elementndashwise Products

Having all transformed input and kernel tiles the prendashtransformed output tiles I primebc primen are computed via

I primebc primen =Csumc=1

Ibcn ⊙ Wcc prime (11)

Here all of the prendashtransformed output tiles (I primebc primen ) trans-formed input tiles Ibcn and transformed kernels Wcc prime havethe same size t times t Computing an element of the prendashtransformed output tile at a location rege depends only on ele-ments of the transformed input tiles and transformed kernelsat the same location and is computed via

I primebc primen (rege) =Csumc=1

I (rege)bcn middot W (rege)cc prime (12)

Note that the equation above can be interpreted as multi-plication of a matrix Urege of size BN times C with a matrix Vregeof size C timesC prime resulting in a matrix X rege of size BN timesC prime Forlayers of modern ConvNets values of BN are much largerthan C which results in multiplication of tall and skinny

matrices [18 21] Such matrix multiplications have to beperformed for each element location of I primebc primen

A31 Operations Required

In the Winograd method t2 real matrices are multipliedrequiring a total of 2t2BNCC prime operations for the whole layerwhere N is the number of tiles per image

For the FFT based methods complex matrices are mul-tiplied However only t lceil(t + 1)2rceil locations of the prendashtransformed output tiles need to be computed As both thetransformed input tiles and transformed kernels are conju-gate symmetric the prendashtransformed output tiles will alsobe conjugate symmetricIn the RegularndashFFT method 4 real multiplyndashadds are re-

quired for performing a complex multiplyndashadd This gives usa total of 8t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer GaussndashFFT reduces complex matrix multiplicationsto 3 real matrix multiplications of matrices of the same sizetotaling 6t lceil(t + 1)2rceilBNCC prime FLOPs required for the wholelayer

A32 Data Movement

In the Winograd method t2 real matrix multiplications X =U timesV are performed (for the rest of the section we will omitthe subscript rege for clarity) In the RegularndashFFTmethod t lceil(t+1)2rceil complex matrix multiplications are performed withthe same sizes The GaussndashFFTmethod replaces one complexmatrix multiplication with three real matrix multiplicationsthus a total of 3t lceil(t + 1)2rceil real matrix multiplications areperformed

For modern ConvNets the matricesX U andV may not fitin cache in which case standard cachendashblocking approaches

10

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 11: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

are employed [14ndash16 18 22] and might require some data tobe moved to andor from the main memory multiple times

In the standard cachendashblocking techniqueV (of sizeCtimesC prime)is subdivided into equally sized matrices of size c times c primeTo minimize transfers to and from main memory a subndash

matrix of V is kept in cache A small subndashmatrix of U withsize of ρtimesc is fetched from memory multiplied with thesubndashmatrix of V stored inndashcache and accumulated to theappropriate subndashmatrix of X Here ρ is a small number re-quired for efficient inndashcache computation [14ndash16 18 22]

This requires transferring a total of ρc numbers frommainmemory and ρc prime numbers bytes of X from and then backto the main memory A total of ρ(c + 2c prime) numbers In thespecial case when c = C only ρ(c + c prime) numbers need to bemoved as each subndashmatrix multiplication produces the finalresult (no accumulation is necessary)Total of BNCC primeρcc prime such subndashmatrix multiplications are

performed and require BNCC primecc prime(c + αc prime) numbers to bemoved With α being 1 when c = C and 2 when c lt C In the Winograd method t2 real matrix multiplications

(X = U times V ) are performed thus transferring a total of4t2BN (c + αc prime)CC primecc prime bytes to be moved

In the RegularndashFFT method t lceil(t + 1)2rceil complex matrixmultiplications are performed requiring a total of 8t lceil(t +1)2rceilBN (c + αc prime)CC primecc prime bytes to be moved

The GaussndashFFT replaces each of the t lceil(t + 1)2rceil complexmatrix multiplications with 3 real matrix multiplicationsTotal of 12t lceil(t + 1)2rceilBN (c + αc prime)CC primecc prime bytes need to betransferredThe total number of operations is fixed To minimize the

amount of data movements we need to choose optimal valuesfor c and c prime while allowing for inndashcache computation

As the values of t BC C prime and N are constant the optimalvalues of c and c prime can be chosen by solving the followingoptimization problem

minimizeccprime

(c + αc prime)cc prime

subject to c | C (C is divisible by c)cprime | C prime (C prime is divisible by cprime)

4βccprime le Cache Size2

(fits in half cache)

(13)

Where α equals 1 when c = C and 2 when c lt C β is 1 forreal valued matrices and 2 for complex valued ones Halfthe cache is allowed for subndashmatrices of V This is typicalpractice to ensure enough space for subndashmatrices of U andX

The arithmetic intensities can be then computed by di-viding the number of required operations with the amountof required data movements for optimal values of c and c primeThis results in the AIs of cc prime2(c + αc prime) for the Winogradand GaussndashFFT methods and cc prime(c + αc prime) for the RegularndashFFT method which are equal to the AIs of real matrix andcomplex matrix multiplications respectively [14ndash16 18]

In Fig 4 we show the arithmetic intensities for layers withdifferent numbers of channels as a function of cache size The

20

40

60

0 250 500 750 1000

Real matrices

30

50

70

0 250 500 750 1000

Complex matrices

C amp C

64

128

256

512

1024

Cache size (KB)

Arit

hmet

ic In

tens

ity

Figure 4 Arithmetic intensities for the elementndashwise stage givenvarious amounts of cache In the Winograd and GaussndashFFT methodreal matrix multiplications are performed whereas in the RegularndashFFT complex matrix multiplications are performed

AIs of both complex (used in RegularndashFFT method) and realmatrix multiplication (used in Winograd and GaussndashFFT)increase with the cache size and the number of channels (Cand Crsquo) For a fixed cache size the complex matrix multipli-cation allows for a higher arithmetic intensity This indicatesthat the elementndashwise stage of the RegularndashFFT method mayachieve better hardware utilization than the ones of Wino-grad or GaussndashFFT convolution when the cache size is alimiting resourceA4 Output Transform

In the output transform stage each of the BNC prime prendashtransformed output tiles I primebc primen is transformed from theWinogradFFT domain back to the spatial domain via I primebcn =AT I primebcnA The matrix A has size of t timesm (t = m + r minus 1)resulting in I primebc primen having a size ofm timesm Here I primebc primen is thenndashth (nonndashoverlapping) tile of the c prime-th image in batch b ofthe final result of the convolutional layerPerforming the inverse transforms is very similar to per-

forming the input transforms except that 1) Only a subsetof mtimesm elements need to be computed and 2) Different(inverse transform) matrices (or inverse FFT transforms) areusedGiven the analysis of the input and kernel transforms

analyzing the output transforms is straightndashforward Thetotal number of operations data movements and AIs fortransforming output tiles of a particular layer using any ofthe three methods are given in Tbl 2

B Model Lookup TablesIn Tbl 3 and Tbl 4 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the Winograd method

In Tbl 5 and Tbl 6 we show the lookup tables used in ourmodel to determine the number of required operations andthe arithmetic intensities for transforming a single inputout-put tile or a single kernel for the RegularndashFFT methodIn Tbl 7 and Tbl 8 we show the lookup tables used in

our model to determine the number of required operations11

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 12: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Table 3 FLOPS required for performing Winograd transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 12 5 5 32 49 42 180 198 154 240 231 168 742 728 504 704 645 430F(32 r2) 32 24 28 180 128 128 240 170 153 742 552 460 704 518 407 - - -F(42 r2) 180 70 90 240 135 150 742 396 396 704 403 372 - - - - - -F(52 r2) 240 72 99 742 260 312 704 300 325 - - - - - - - - -F(62 r2) 742 144 208 704 242 308 - - - - - - - - - - - -F(72 r2) 704 130 195 - - - - - - - - - - - - - - -

Table 4 AIs of Winograd transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 017 010 010 025 049 053 090 121 133 083 095 105 189 214 238 138 143 158F(32 r2) 025 030 028 090 094 094 083 082 085 189 186 198 138 129 139 - - -F(42 r2) 090 060 055 083 075 072 189 152 152 138 113 116 - - - - - -F(52 r2) 083 045 041 189 112 105 138 094 091 - - - - - - - - -F(62 r2) 189 068 061 138 083 077 - - - - - - - - - - - -F(72 r2) 138 048 043 - - - - - - - - - - - - - - -

Table 5 FLOPS required for performing RegularndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 54 36 42 72 48 48 245 206 152 300 256 160 702 632 288 492 434 191F(32 r2) 72 28 66 245 171 186 300 192 197 702 566 444 492 380 263 1292 1088 678F(42 r2) 245 92 224 300 158 232 702 504 504 492 320 335 1292 992 758 1200 904 768F(52 r2) 300 128 274 702 325 568 492 276 400 1292 805 850 1200 792 832 2710 2118 1820F(62 r2) 702 172 636 492 206 453 1292 624 930 1200 722 912 2710 1980 1956 1416 924 1074F(72 r2) 492 146 503 1292 491 1014 1200 656 996 2710 1523 2096 1416 896 1141 3681 2583 2646F(82 r2) 1292 296 1102 1200 564 1084 2710 1108 2240 1416 823 1190 3681 2432 2808 3228 2082 2384F(92 r2) 1200 340 1176 2710 735 2388 1416 672 1262 3681 2304 3066 3228 1960 2526 3405 2254 2499F(102 r2) 2710 404 2540 1416 610 1336 3681 2052 3240 3228 1842 2672 3405 2082 2616 2904 1688 2292F(112 r2) 1416 364 1412 3681 1663 3418 3228 1504 2822 3405 1865 2897 2904 1530 2428 7793 5153 6354F(122 r2) 3681 724 3600 3228 1162 2976 3405 1628 3052 2904 1374 2562 7793 4716 6630 6068 3479 4228F(132 r2) 3228 648 3134 3405 1324 3211 2904 1246 2707 7793 4341 6892 6068 3290 4434 14418 8629 11023F(142 r2) 3405 764 3374 2904 936 2824 7793 4058 7167 6068 2660 4648 14418 8176 11410 5440 2876 4512F(152 r2) 2904 634 2976 7793 3231 7446 6068 1944 4840 14418 7305 11775 5440 2580 4682 8658 5617 6841F(162 r2) 7793 1616 7720 6068 1540 5036 14418 6668 12288 5440 2409 4878 8658 4880 7470 11852 5742 9976F(172 r2) 6068 1132 5236 14418 6266 12699 5440 2204 5078 8658 4515 7733 11852 4920 10296 26954 15562 23206F(182 r2) 14418 4516 13082 5440 1919 5282 8658 3630 8000 11852 4268 10620 26954 15072 23808 7524 3580 6622F(192 r2) 5440 1196 5490 8658 2557 8271 11852 3432 10948 26954 11815 24414 7524 3440 6837 17260 7145 12714F(202 r2) 8658 1448 8546 11852 2718 11280 26954 10880 25024 7524 3104 6988 17260 6432 13066 16128 8218 14180F(212 r2) 11852 1552 11616 26954 9981 25638 7524 2612 7213 17260 5775 13422 16128 7944 14560 21050 8309 15183F(222 r2) 26954 8396 26624 7524 2176 7363 17260 4746 13782 16128 7146 14944 21050 7586 15566 14184 6929 12426F(232 r2) 7524 1442 7570 17260 3773 14146 16128 6544 15332 21050 6308 15953 14184 6552 12760 34745 18690 31062F(242 r2) 17260 2076 14514 16128 6216 15724 21050 5100 16344 14184 6205 13098 34745 17994 31704 15480 7333 13912F(252 r2) 16128 2596 16120 21050 4118 16739 14184 5624 13440 34745 17385 32350 15480 6704 14284 38400 19470 34934F(262 r2) 21050 2516 17138 14184 4089 13786 34745 15508 33000 15480 6218 14610 38400 19142 35600 15804 7224 14352F(272 r2) 14184 2356 14136 34745 14655 33654 15480 4856 14933 38400 18275 36270 15804 6574 14669 - - -F(282 r2) 34745 11276 34544 15480 3842 15292 38400 17596 36944 15804 5924 14962 - - - - - -F(292 r2) 15480 2840 15655 38400 14293 37622 15804 5302 15286 - - - - - - - - -F(302 r2) 38400 10940 38394 15804 3856 15648 - - - - - - - - - - - -F(312 r2) 15804 2570 16014 - - - - - - - - - - - - - - -

12

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 13: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Table 6 AIs of RegularndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

F(22 r2) 064 056 066 045 036 043 111 112 112 089 088 077 167 172 120 085 084 057F(32 r2) 045 025 050 111 110 119 089 075 086 167 175 171 085 082 074 189 196 171F(42 r2) 111 068 122 089 069 091 167 175 175 085 076 087 189 197 179 136 134 141F(52 r2) 089 062 094 167 125 175 085 072 095 189 175 185 136 127 143 268 293 290F(62 r2) 167 072 173 085 058 098 189 147 185 136 124 146 268 295 291 113 106 132F(72 r2) 085 043 097 189 124 182 136 121 147 268 243 290 113 110 131 262 280 286F(82 r2) 189 079 179 136 109 147 268 187 286 113 107 128 262 279 285 192 191 207F(92 r2) 136 069 146 268 130 280 113 091 127 262 278 291 192 188 207 183 195 195F(102 r2) 268 074 274 113 086 125 262 259 287 192 185 206 183 189 192 133 125 148F(112 r2) 113 053 122 262 218 282 192 157 204 183 176 201 133 118 148 327 363 372F(122 r2) 262 097 276 192 125 202 183 159 199 133 110 148 327 345 368 222 213 210F(132 r2) 192 071 199 183 133 196 133 102 148 327 328 363 222 208 210 486 503 502F(142 r2) 183 078 193 133 079 146 327 315 357 222 173 209 486 491 495 162 147 177F(152 r2) 133 054 145 327 256 351 222 129 207 486 451 487 162 136 176 240 275 249F(162 r2) 327 130 343 222 104 204 486 421 483 162 130 175 240 245 260 293 249 318F(172 r2) 222 078 202 486 403 475 162 121 174 240 232 257 293 218 315 623 647 690F(182 r2) 486 294 465 162 107 173 240 190 254 293 193 312 623 641 679 157 133 175F(192 r2) 162 067 171 240 136 251 293 158 308 623 512 669 157 130 174 338 256 314F(202 r2) 240 078 248 293 127 304 623 479 657 157 120 171 338 234 311 287 264 314F(212 r2) 293 073 300 623 445 645 157 102 169 338 214 308 287 260 311 354 258 317F(222 r2) 623 378 642 157 086 166 338 178 304 287 237 308 354 239 314 218 195 235F(232 r2) 157 057 164 338 143 300 287 220 305 354 202 310 218 187 233 508 508 555F(242 r2) 338 079 296 287 211 301 354 165 307 218 179 231 508 497 548 208 182 226F(252 r2) 287 089 298 354 135 303 218 164 229 508 486 541 208 168 225 492 468 540F(262 r2) 354 083 299 218 120 227 508 438 534 208 158 223 492 466 534 187 159 203F(272 r2) 218 070 225 508 417 526 208 124 221 492 449 527 187 146 202 - - -F(282 r2) 508 323 522 208 099 219 492 436 520 187 133 200 - - - - - -F(292 r2) 208 074 217 492 357 513 187 120 198 - - - - - - - - -F(302 r2) 492 275 507 187 088 197 - - - - - - - - - - - -F(312 r2) 187 059 195 - - - - - - - - - - - - - - -

and the arithmetic intensities for transforming a single in-putoutput tile or a single kernel for the GaussndashFFT method

C Additional Model VerificationIn Fig 5 we show the theoretical predictions and the empiri-cal results of the GaussndashFFT vs the RegularndashFFT methods

D Comparing to StatendashofndashthendashartIn Fig 6 and Fig 7 we show the results of the absolute timerequired for processing unique layers of VGG and AlexNetof our implementation and other statendashofndashthendashart librariesThe benchmarks were performed on AVX512 capable sys-tems as the neither MKL-DNN nor LIBXSMM support AVX2Both MKL-DNN and LIBXSMM support only kernels of

size 3 times 3 for their Winograd implementation thus the mea-surements for the layer 2 of AlexNet which have kerenls ofsize 5 times 5 are missing

13

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 14: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Table 7 FLOPS required for performing GaussndashFFT transforms of a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 63 54 60 88 80 80 270 256 202 336 328 232 751 730 386 556 562 319G(32 r2) 88 60 98 270 221 236 336 264 269 751 664 542 556 508 391 1373 1250 840G(42 r2) 270 142 274 336 230 304 751 602 602 556 448 463 1373 1154 920 1300 1104 968G(52 r2) 336 200 346 751 423 666 556 404 528 1373 967 1012 1300 992 1032 2831 2360 2062G(62 r2) 751 270 734 556 334 581 1373 786 1092 1300 922 1112 2831 2222 2198 1560 1212 1362G(72 r2) 556 274 631 1373 653 1176 1300 856 1196 2831 1765 2338 1560 1184 1429 3850 2921 2984G(82 r2) 1373 458 1264 1300 764 1284 2831 1350 2482 1560 1111 1478 3850 2770 3146 3424 2474 2776G(92 r2) 1300 540 1376 2831 977 2630 1560 960 1550 3850 2642 3404 3424 2352 2918 3630 2704 2949G(102 r2) 2831 646 2782 1560 898 1624 3850 2390 3578 3424 2234 3064 3630 2532 3066 3160 2200 2804G(112 r2) 1560 652 1700 3850 2001 3756 3424 1896 3214 3630 2315 3347 3160 2042 2940 8082 5731 6932G(122 r2) 3850 1062 3938 3424 1554 3368 3630 2078 3502 3160 1886 3074 8082 5294 7208 6392 4127 4876G(132 r2) 3424 1040 3526 3630 1774 3661 3160 1758 3219 8082 4919 7470 6392 3938 5082 14779 9351 11745G(142 r2) 3630 1214 3824 3160 1448 3336 8082 4636 7745 6392 3308 5296 14779 8898 12132 5840 3676 5312G(152 r2) 3160 1146 3488 8082 3809 8024 6392 2592 5488 14779 8027 12497 5840 3380 5482 9099 6499 7723G(162 r2) 8082 2194 8298 6392 2188 5684 14779 7390 13010 5840 3209 5678 9099 5762 8352 12336 6710 10944G(172 r2) 6392 1780 5884 14779 6988 13421 5840 3004 5878 9099 5397 8615 12336 5888 11264 27483 16620 24264G(182 r2) 14779 5238 13804 5840 2719 6082 9099 4512 8882 12336 5236 11588 27483 16130 24866 8100 4732 7774G(192 r2) 5840 1996 6290 9099 3439 9153 12336 4400 11916 27483 12873 25472 8100 4592 7989 17885 8395 13964G(202 r2) 9099 2330 9428 12336 3686 12248 27483 11938 26082 8100 4256 8140 17885 7682 14316 16804 9570 15532G(212 r2) 12336 2520 12584 27483 11039 26696 8100 3764 8365 17885 7025 14672 16804 9296 15912 21779 9767 16641G(222 r2) 27483 9454 27682 8100 3328 8515 17885 5996 15032 16804 8498 16296 21779 9044 17024 14968 8497 13994G(232 r2) 8100 2594 8722 17885 5023 15396 16804 7896 16684 21779 7766 17411 14968 8120 14328 35586 20372 32744G(242 r2) 17885 3326 15764 16804 7568 17076 21779 6558 17802 14968 7773 14666 35586 19676 33386 16380 9133 15712G(252 r2) 16804 3948 17472 21779 5576 18197 14968 7192 15008 35586 19067 34032 16380 8504 16084 39361 21392 36856G(262 r2) 21779 3974 18596 14968 5657 15354 35586 17190 34682 16380 8018 16410 39361 21064 37522 16828 9272 16400G(272 r2) 14968 3924 15704 35586 16337 35336 16380 6656 16733 39361 20197 38192 16828 8622 16717 - - -G(282 r2) 35586 12958 36226 16380 5642 17092 39361 19518 38866 16828 7972 17010 - - - - - -G(292 r2) 16380 4640 17455 39361 16215 39544 16828 7350 17334 - - - - - - - - -G(302 r2) 39361 12862 40316 16828 5904 17696 - - - - - - - - - - - -G(312 r2) 16828 4618 18062 - - - - - - - - - - - - - - -

050

075

100

125

150

0 20 40 60

VGG 12

050

075

100

125

150

0 20 40 60

VGG 21

050

075

100

125

150

0 20 40 60

VGG 22

050

075

100

125

150

0 20 40 60

VGG 31

050

075

100

125

150

0 20 40 60

VGG 32

050

075

100

125

150

0 20 40 60

VGG 41

050

075

100

125

150

0 20 40 60

VGG 42

050

075

100

125

150

0 20 40 60

VGG 5152

050

075

100

125

150

0 20 40 60

AlexNet 2

050

075

100

125

150

0 20 40 60

AlexNet 3

050

075

100

125

150

0 20 40 60

AlexNet 4

050

075

100

125

150

0 20 40 60

AlexNet 5

GaussminusFFT vs RegularminusFFT

Spe

edup

Figure 5 Theoretical estimates of our model and empirical measurements for the speedup of the Regularndash vs the GaussndashFFT method VGGand AlexNet

14

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 15: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

Table 8 AIs of GaussndashFFT transforms for a single tile or kernel

r = 2 r = 3 r = 4 r = 5 r = 6 r = 7Method In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out In Ker Out

G(22 r2) 058 061 068 042 044 050 096 105 103 078 085 076 141 152 110 076 083 064G(32 r2) 042 038 054 096 102 109 078 075 083 141 152 146 076 081 076 159 170 146G(42 r2) 096 072 112 078 071 086 141 150 150 076 077 085 159 169 152 116 121 123G(52 r2) 078 066 089 141 114 153 076 074 091 159 151 158 116 115 126 222 239 231G(62 r2) 141 077 153 076 065 093 159 130 160 116 112 129 222 237 235 098 101 118G(72 r2) 076 055 093 159 113 160 116 109 131 222 198 237 098 103 119 218 227 232G(82 r2) 159 082 159 116 101 132 222 158 237 098 100 117 218 224 233 161 161 174G(92 r2) 116 073 132 222 118 236 098 090 116 218 222 240 161 158 175 155 165 167G(102 r2) 222 080 233 098 086 115 218 207 240 161 155 176 155 160 167 115 114 132G(112 r2) 098 064 114 218 177 238 161 135 176 155 150 174 115 109 133 270 282 299G(122 r2) 218 096 236 161 113 175 155 138 174 115 103 133 270 267 299 185 175 178G(132 r2) 161 076 175 155 120 173 115 098 134 270 254 297 185 171 179 397 378 397G(142 r2) 155 083 172 115 082 133 270 244 296 185 146 180 397 367 396 138 130 155G(152 r2) 115 066 133 270 203 293 185 117 179 397 337 393 138 121 155 201 219 210G(162 r2) 270 118 290 185 100 179 397 315 394 138 117 155 201 198 220 242 199 261G(172 r2) 185 082 177 397 302 391 138 111 155 201 188 219 242 178 260 506 474 543G(182 r2) 397 228 386 138 102 155 201 159 218 242 160 260 506 467 540 134 120 154G(192 r2) 138 075 154 201 122 217 242 136 258 506 377 536 134 118 154 279 205 261G(202 r2) 201 084 216 242 115 257 506 354 531 134 111 152 279 190 260 238 210 260G(212 r2) 242 079 255 506 330 526 134 099 152 279 176 259 238 206 259 292 206 264G(222 r2) 506 284 527 134 088 150 279 151 258 238 190 259 292 193 263 183 162 201G(232 r2) 134 069 149 279 128 256 238 178 257 292 168 262 183 157 200 415 376 446G(242 r2) 279 085 254 238 172 256 292 143 260 183 151 200 415 367 444 175 153 195G(252 r2) 238 090 254 292 122 259 183 141 199 415 358 441 175 144 195 402 348 436G(262 r2) 292 087 257 183 111 198 415 325 438 175 137 194 402 346 433 158 138 178G(272 r2) 183 078 197 415 311 434 175 114 193 402 334 431 158 129 177 - - -G(282 r2) 415 247 434 175 097 192 402 324 428 158 120 176 - - - - - -G(292 r2) 175 080 191 402 271 424 158 111 175 - - - - - - - - -G(302 r2) 402 216 422 158 090 175 - - - - - - - - - - - -G(312 r2) 158 071 174 - - - - - - - - - - - - - - -

15

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 16: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

139

44

698

5

617

2

353

7

363

1

347

4

0

50

100

150VGG 12

703

0

335

0

251

6

166

3

164

3

164

8

0

20

40

60

VGG 21

135

67

680

4364

3

264

2

271

2

272

9

0

50

100

VGG 22

736

3

323

8

156

0

126

4

137

9

121

6

0

20

40

60

80VGG 31

111

98

658

8

335

9

227

3

255

1

215

8

0

30

60

90

120VGG 32

760

6

313

0

155

3

123

0

131

2

113

1

0

20

40

60

80

VGG 41

128

10

626

9

284

9

231

9

257

7

215

4

0

50

100

VGG 42

669

9

184

9

124

2

824

767

748

0

20

40

60

VGG 5152

na

145

8

na

386

384

101

9

0

5

10

15

AlexNet 2

217

4

954

687

364

351

427

0

5

10

15

20

AlexNet 3

369

9

129

5

102

6

493

513

557

0

10

20

30

40AlexNet 4

315

2

874

853

327

353

370

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 4096 GBs CMR of 11T

ime

(mill

isec

onds

)

239

57

291

76

221

71145

64

116

65

173

13

0

100

200

300

VGG 12

734

4970

1

810

2

614

1

512

3

679

6

0

25

50

75

100

VGG 21

140

67

232

48

144

48843

2

817

5

949

0

0

50

100

150

200

250VGG 22

473

4

109

89

767

5

390

9

390

8

401

7

0

30

60

90

120VGG 31

824

0

238

56

143

91644

1

684

6

652

4

0

50

100

150

200

250

VGG 32

389

6

136

04

672

4

362

4

378

2

305

8

0

50

100

VGG 41

410

9

163

66

883

3662

2

715

0

583

5

0

50

100

150

VGG 42

295

4

693

2

405

9

206

3

221

9

205

0

0

20

40

60

VGG 5152n

a

169

02

na14

42

110

7

413

7

0

50

100

150

AlexNet 2

159

1756

3

321

8

107

5

103

5

127

6

0

20

40

60

80

AlexNet 3

198

3627

8

437

7

144

1

140

5

159

4

0

20

40

60

AlexNet 4

151

0

422

2

456

4

102

6

96711

36

0

10

20

30

40

50AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 96 GBs CMR of 22

Tim

e (m

illis

econ

ds)

na

105

95n

a30

12

991

796

0

30

60

90

AlexNet 2

106

1369

827

33

916

788

733

0

10

20

30

40

AlexNet 3

135

8392

830

15

114

210

05

995

0

10

20

30

40

AlexNet 49

3525

36

206

38

097

026

72

0

10

20

AlexNet 5

160

0113

521

128

3012

212

106

7382

31

0

50

100

150

VGG 12

596

857

36

524

048

08

440

336

37

0

20

40

60

VGG 21

840

611

492

834

968

35

616

157

21

0

25

50

75

100

125VGG 22

323

954

70

319

728

64

283

327

97

0

20

40

60VGG 31

570

010

808

733

747

13

462

248

77

0

30

60

90

VGG 32

262

554

76

283

8222

225

85

268

6

0

20

40

60VGG 41

507

610

635

574

8417

147

60

503

1

0

30

60

90

VGG 42

197

9295

917

6714

58

148

615

59

0

10

20

30

VGG 5152

RegularminusFFT GaussminusFFT Winograd (Jia et al) MKLminusDNN Win MKLminusDNN Direct LIBXSMM Win

Tim

e (m

illis

econ

ds)

314

72

315

87

281

13

189

55

142

48

225

97

0

100

200

300

VGG 12

903

8

110

24

103

80

799

0

609

3885

9

0

30

60

90

120VGG 21

160

52209

11

152

36103

31

927

6

116

63

0

50

100

150

200

VGG 22

539

4

135

21

777

9468

0

437

8

469

5

0

50

100

VGG 31

931

0

240

95

180

10

735

7

749

2

727

9

0

50

100

150

200

250

VGG 3243

56

139

03

711

1414

3

413

2

340

0

0

50

100

150VGG 41

809

6

207

68

133

21736

4

775

4

635

6

0

50

100

150

200

VGG 42

316

3

714

6

284

3

230

9

440

7

223

0

0

20

40

60

VGG 5152

na

153

37

na19

59

137

8533

5

0

50

100

150

AlexNet 2

168

8710

7

382

0

130

1

116

3

142

7

0

20

40

60

AlexNet 3

227

9102

89

628

9

175

8

159

0

177

6

0

30

60

90

AlexNet 4

181

1653

0

370

8

125

1

109

5

126

2

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 683 GBs CMR of 31

Tim

e (m

illis

econ

ds)

Figure 6 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems

16

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart
Page 17: FFT Convolutions are Faster than Winograd on Modern CPUs ... · Using the same optimization techniques, we additionally implemented two FFT–based algorithms for modern CPUs. One

283

83

155

28216

12124

36

906

9

146

12

0

100

200

300

VGG 12

102

92

694

5

829

5

519

2383

1

568

7

0

30

60

90

VGG 21

178

06

138

54

104

85683

1

528

2

745

0

0

50

100

150

VGG 22

102

36

442

4

372

2

293

4

249

6

291

5

0

30

60

90

VGG 31

121

72

875

5

675

7446

6

445

5

419

1

0

50

100

VGG 32

933

3

420

0

272

2

248

6

241

8

193

8

0

25

50

75

100VGG 41

145

85

833

4509

5

452

8

444

4

354

0

0

50

100

150

VGG 42

647

3

232

8

185

5

149

1

140

7

124

4

0

20

40

60

VGG 5152

na19

95

na

122

4

852

359

8

0

10

20

30

AlexNet 2

226

6

116

1

130

2873

717815

0

5

10

15

20

AlexNet 3

395

5

162

2

193

6

116

3

106

5

106

7

0

10

20

30

40

AlexNet 4

246

2

108

7

129

0779

725

729

0

10

20

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 (using 48 cores) 1024 GBs CMR of 33T

ime

(mill

isec

onds

)

297

93

146

66

206

10126

13

944

4

146

57

0

100

200

300

VGG 12

116

19

638

2

718

8526

0

393

8

575

9

0

25

50

75

100

125VGG 21

215

60

129

62

107

95711

7

524

6

770

6

0

50

100

150

200

VGG 22

105

16

559

3391

3

308

3

240

7

294

5

0

30

60

90

VGG 31

106

90

117

46

976

6

438

2

425

6

398

1

0

40

80

120

VGG 32

104

13

334

4

435

3

226

1

209

6

174

0

0

30

60

90

VGG 41

142

18

667

2

740

1415

0

388

2

308

6

0

50

100

150

VGG 42

798

9

186

1

167

3

139

2

124

0

109

3

0

20

40

60

80

VGG 5152n

a

156

1

na

122

7

865

363

6

0

10

20

30

40AlexNet 2

287

0

95214

12

852

661800

0

10

20

30

AlexNet 3

333

1

130

4

229

1

106

1

957

927

0

10

20

30

AlexNet 4

299

9

87612

33

756

670

689

0

10

20

30

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

Xeon Phi 7210 1024 GBs CMR of 3911

Tim

e (m

illis

econ

ds)

403

59

361

04

291

20

254

04182

45

292

40

0

100

200

300

400

VGG 12

114

93

136

57

121

77

103

81

766

8

113

60

0

50

100

150VGG 21

187

07281

99

238

27

132

57

106

41

148

55

0

100

200

300VGG 22

638

0

142

72

949

7

592

7

493

5

590

1

0

50

100

150

VGG 31

105

34

263

79

188

49

856

3

830

4

820

4

0

100

200

VGG 32

491

5

134

24

896

6

484

0

458

3

379

9

0

50

100

VGG 41

887

1

248

79

170

67865

1

849

7

695

1

0

100

200

VGG 42

343

9

713

3

444

4

261

4

263

0

250

7

0

20

40

60

VGG 5152

na

186

41

na24

94

176

1685

8

0

50

100

150

200AlexNet 2

188

0700

4

640

6

160

5

133

8

163

3

0

20

40

60

AlexNet 326

32

102

98

658

6

207

5

183

8

201

5

0

30

60

90

AlexNet 4

200

6691

4

438

8

148

7

124

2

144

3

0

20

40

60

AlexNet 5

Our Winograd Our RegularminusFFT Our GaussminusFFT MKLminusDNN Winograd MKLminusDNN Direct LIBXSMM Winograd

i9minus7900X 512 GBs CMR of 4125

Tim

e (m

illis

econ

ds)

Figure 7 Running time of our implementations compared to statendashofndashthendashart on AVX512 capable systems continued

17

  • Abstract
  • 1 Introduction
  • 2 Background
    • 21 Winogradndash and FFTndash Based Convolutions
    • 22 Winogradndash and FFTndash Based Convolution Layers
    • 23 Gauss Multiplication of Complex Numbers
      • 3 Implementations
      • 4 Performance Comparisons
      • 5 Performance Analysis
        • 51 Relative Performances
        • 52 The Accuracy of the Theoretical Estimates
        • 53 Discussions
          • 6 Conclusions
          • References
          • A Detailed Analysis
            • A1 Input Transform Stage
            • A2 Kernel Transform Stage
            • A3 Elementndashwise Products
            • A4 Output Transform
              • B Model Lookup Tables
              • C Additional Model Verification
              • D Comparing to Statendashofndashthendashart