coordinating filters for faster deep neural networkscoordinating filters for faster deep neural...

Coordinating Filters for Faster Deep Neural Networks

Wei WenUniversity of Pittsburgh

[email protected]

Cong XuHewlett Packard Labs

[email protected]

Chunpeng WuUniversity of Pittsburgh

[email protected]

Yandan WangUniversity of Pittsburgh

[email protected]

Yiran ChenDuke University

[email protected]

Hai LiDuke [email protected]

Abstract

Very large-scale Deep Neural Networks (DNNs) haveachieved remarkable successes in a large variety of com-puter vision tasks. However, the high computation intensityof DNNs makes it challenging to deploy these models onresource-limited systems. Some studies used low-rank ap-proaches that approximate the filters by low-rank basis toaccelerate the testing. Those works directly decomposedthe pre-trained DNNs by Low-Rank Approximations (LRA).How to train DNNs toward lower-rank space for more ef-ficient DNNs, however, remains as an open area. To solvethe issue, in this work, we propose Force Regularization,which uses attractive forces to enforce filters so as to coor-dinate more weight information into lower-rank space1. Wemathematically and empirically verify that after applyingour technique, standard LRA methods can reconstruct filtersusing much lower basis and thus result in faster DNNs. Theeffectiveness of our approach is comprehensively evaluatedin ResNets, AlexNet, and GoogLeNet. In AlexNet, for ex-ample, Force Regularization gains 2× speedup on modernGPU without accuracy loss and 4.05× speedup on CPU bypaying small accuracy degradation. Moreover, Force Reg-ularization better initializes the low-rank DNNs such thatthe fine-tuning can converge faster toward higher accuracy.The obtained lower-rank DNNs can be further sparsified,proving that Force Regularization can be integrated withstate-of-the-art sparsity-based acceleration methods.

1. IntroductionDeep Neural Networks (DNNs) have achieved record-

breaking accuracy in many image classification tasks [16][24][25][10]. With the advances of algorithms, availabil-ity of database, and improvement in hardware performance,

1The source code is available in https://github.com/wenwei202/caffe

Figure 1. The low-rank basis of filters in the first layer of the con-volutional neural network [16] on CIFAR-10. The low-rank basisis formed by the most significant principal filters that are obtainedby PCA. Top: the low-rank basis of the original network. Bottom:the low-rank basis of the same network after applying Force Reg-ularization. The number of red boxes indicates the required rankto reconstruct the original filters with ≤ 20% error.

the depth of DNNs grows dramatically from a few to hun-dreds or even thousands of layers, enabling human-levelperformance [9]. However, deploying these large models onresource-limited platforms, e.g., mobiles and autonomouscars, is very challenging due to the high demand in the com-putation resource and hence energy consumption.

Recently, many techniques to accelerate the testing pro-cess of deployed DNNs have been studied, such as weightsparsifying or connection pruning [8][7][28][23][22][6][19]. These approaches require delicate hardware cus-tomization and/or software design to transfer sparsity intopractical speedup. Unlike sparsity-based methods, Low-Rank Approximation (LRA) methods [22][4][5][12][11][26][27][18][30][14] directly decompose an original largemodel to a compact model with more lightweight layers.Thanks to the redundancy (correlation) among filters inDNNs, original weight tensors can be approximated by verylow-rank basis. From the viewpoint of matrix computation,LRA approximates a large weight matrix by the product oftwo or more small ones to reduce computation complexity.

Previous LRA methods mostly focus on how to decom-pose the pre-trained weight tensors for maximizing the re-duction of computation complexity, meanwhile retainingthe classification accuracy. Instead, we propose to nudgethe weights by additional gradients (attractive forces) to co-ordinate the filters to a more correlated state. Our approach

arX

iv:1

703.

0974

6v3

[cs

.CV

] 2

5 Ju

l 201

7

https://github.com/wenwei202/caffe

https://github.com/wenwei202/caffe

aims to improve the correlation among filters and thereforeobtain more lightweight DNNs through LRA. To the best ofour knowledge, this is the first work to train DNNs towardlower-rank space such that LRA can achieve faster DNNs.

The motivation of this work is fundamental. It has beenproven that trained filters are highly clustered and corre-lated [5][4][12]. Suppose each filter is reshaped as a vector.A cluster of highly-correlated vectors then will have smallincluded angles. If we are able to coordinate these vectorstoward a state with smaller included angles, the correlationof the filters within that cluster improves. Consequently,LRA can produce a DNN with lower ranks and higher com-putation efficiency.

We propose a Force Regularization to coordinate fil-ters in DNNs. As demonstrated in Fig. 1, when usingthe same LRA method, say, cross-filter Principal Compo-nent Analysis (PCA) [30], applying Force Regularizationcan greatly reduce the required ranks from the original de-sign (i.e., 5 vs. 11), while keeping the same approximationerrors (≤ 20%). As we shall show in Section 5, apply-ing Force Regularization in the training of state-of-the-artDNNs will successfully obtain lower-rank DNNs and thusimprove computation efficiency, e.g., 4.05× speedup forAlexNet with small accuracy loss.

The contributions of our work include: (1) We pro-pose an effective and easy-to-implement Force Regulariza-tion to train DNNs for lower-rank approximation. To thebest of our knowledge, this is the first work to manipulatethe correlation among filters during training such that LRAcan achieve faster DNNs; (2) DNNs manipulated by ForceRegularization can have better initialization for the retrain-ing of LRA-decomposed DNNs, resulting in faster conver-gence to better accuracy; (3) Those lightweight DNNs thathave been aggressively compressed by our method can befurther sparsified. That is, our method can be integratedwith state-of-the-art sparsity-based methods to potentiallyachieve faster computation; (4) Force Regularization canbe easily generalized to Discrimination Regularization thatcan learn more discriminative filters to improve classifica-tion accuracy; (5) Our implementation is open-source onboth CPUs and GPUs.

2. Related workLow-rank approximation. LRA method decomposes a

large model to a compact one with more lightweight lay-ers by weight/tensor factorization. Denil et al. [4] studieddifferent dictionaries to remove the redundancy between fil-ters and channels in DNNs. Jaderberg et al. [12] exploredfilter and data reconstruction optimizations to attain opti-mal separable basis. Denton et al. [5] clustered filters, ex-tended LRA (e.g., Singular Value Decomposition, SVD ) tolarger-scale DNNs, and achieved 2× speedup for the firsttwo layers with 1% accuracy loss. Many new decomposi-

tion methods were proposed [11][26][18][30] and the ef-fectiveness of LRA in state-of-the-art DNNs were evalu-ated [24][25]. Similar evaluations on mobile devices werealso reported [14][27]. Unlike them, we propose Force Reg-ularization to coordinate DNN filters to more correlatedstates, in which lower-rank or more compact DNNs areachievable for faster computation.

Sparse deep neural networks. The studies onsparse DNNs can be categorized into two types: non-structured [20][23][22][8][6] and structured [28][21][19][1]sparsity methods. The first category prunes each connec-tion independently. Consequently, sparse weights are ran-domly distributed. The level of non-structured sparsity isusually insufficient to achieve good practical speedup inmodern hardware [28][19]. Software optimization [23][22]and hardware customization [7] are proposed to overcomethis issue. Conversely, the structured approaches prune con-nections group by group, such that the sparsified DNNshave regular distribution of sparse weights. The regular-ity is friendly to modern hardware for acceleration. Ourwork is orthogonal to sparsity-based methods. More impor-tantly, we find that DNNs accelerated by our method canbe further sparsified by both non-structured and structuredsparsity methods, potentially achieving faster computation.

3. Correlated Filters and Their Approximation

The prior knowledge is that correlation exists amongtrained filters in DNNs and those filters lie in a low-rankspace. For example, the color-agnostic filters [16] learnedin the first layer of AlexNet lie in a hyper-plane, where RGBchannels at each pixel have the same value. Fig. 2 presentsthe results of Linear Discriminant Analysis (LDA) of thefirst convolutional filters in AlexNet and GoogLeNet. Thefilters are normalized to unit vectors and colored to fourclusters by k-means clustering, and then projected to 2Dspace by LDA to maximize cluster separation. The figureindicates high correlation among filters within a cluster. Anaı̈ve approach of filter approximation is to use the centroidof a cluster to approximate filters within that cluster, thus,the number of clusters is the rank of the space. Essentially,k-means clustering is a LRA [2] method, although we will

10 5 0 5 108

6

4

2

0

2

4

6

8

10 8 6 4 2 0 2 4 6 86

4

2

0

2

4

6

8

Figure 2. Linear Discriminant Analysis (LDA) of filters in the firstconvolutional layer of AlexNet (left) and GoogLeNet (right).

…

H×W H×W

1×1

M<<N

N M N C C

Figure 3. Cross-filter LRA of a convolutional layer.

later show that other LRA methods can give better approxi-mation. The motivation of this work is that if we are able tonudge filters during the training such that the filters withina cluster are coordinated closer and some adjacent clustersare even merged into one cluster, then more accurate filterapproximation using lower rank can be achieved. We pro-pose Force Regularization to realize it.

Before introducing Force Regularization, we first mathe-matically formulate LRA of DNN filters. Theoretically, al-most all LRA methods can gain lower-rank approximationupon our method because filters are coordinated to morecorrelated state. Instead of onerously replicating all of theseLRA methods, we choose cross-filter approximation [4][30]and a state-of-the-art work in [26] as our baselines.

Fig. 3 illustrates the cross-filter approximation of a con-volutional layer. We assume all weights in a convolutionallayer is a tensor W ∈ RN×C×H×W , where N and C arethe numbers of filters and input channels, and H and Ware the spatial height and width of the filters, respectively.With input feature map I, the n-th output feature mapOn =Wn ∗ I, whereWn ∈ R1×C×H×W is the n-th filter.Because of the redundancy (or correlation) across the fil-ters [4], tensorWn(∀n ∈ [1...N ]) can be approximated bya linear combination of the basis Bm ∈ R1×C×H×W (m ∈[1...M ],M � N) of a low-rank space B ∈ RM×C×H×W ,such as

On ≈

(M∑

m=1

b(n)m Bm

)∗ I =

M∑m=1

(b(n)m Fm

). (1)

Where b(n)m is a scalar, and Fm = Bm ∗ I is the featuremap generated by basis filter Bm. Therefore, the output fea-ture map On is a linear combination of Fm(m ∈ [1...M ])which can be interpreted as the feature map basis. Since thelinear combination essentially is a 1 × 1 convolution, theconvolutional layer can be decomposed to two sequentiallightweight convolutional layers as shown in Fig. 3. Theoriginal computation complexity is O(NCHWH

′W

′),

where H′

and W′

is the height and width of output fea-ture maps, respectively. After applying cross-filter LRA, thecomputation complexity is reduced toO(MCHWH

′W

′+

NMH′W

′). The computation complexity decreases when

O

wi

wj

Wi

Wj

fjifji-fjiwiTwir=1

ΔWij

Figure 4. Force Regularization to coordinate filters.

the rank M < NCHWCHW+N .

4. Force Regularization

4.1. Regularization by Attractive Forces

This section proposes Force Regularization from the per-spective of physics. It is a gradient-based approach thatadds extra gradients to data loss gradients. The data lossgradients aim to minimize classification error as traditionalDNNs do. The extra gradients introduced by Force Regular-ization gently adjust the lengths and directions of data lossgradients so as to nudge filters to a more correlated state.With a good setup of hyper-parameter, our method can co-ordinate more useful information of filters to a lower-rankspace meanwhile maintain accuracy. Inspired by Newton’sLaws, we propose an intuitive, computation-efficient andeffective Force Regularization that uses attractive forces tocoordinate filters.

Force Regularization: As illustrated in Fig. 4, supposethe filterWn ∈ W is reshaped as a vector Wn ∈ R1×CHW

and normalized as wn ∈ R1×CHW (∀n ∈ [1...N ]), withtheir origin atO. We introduce the pair-wise attractive forcefji = f(wj−wi) (∀i, j ∈ [1...N ]) on wi generated by wj .The gradient of Force Regularization to update filter Wi isdefined as

∆Wi =

N∑j=1

∆Wij = ||Wi||N∑j=1

(fji − fjiw

Ti wi

), (2)

where || · || is the Euclidean norm. The regularizationgradient in Eq. (2) is perpendicular to filter vector and canbe efficiently computed by addition and multiplication. Thefinal updating of weights by gradient descent is

Wi ←Wi − η ·(∂E(W)

∂Wi− λs ·∆Wi

), (3)

where E(W) is data loss, η is learning rate and λs > 0 isthe coefficient of Force Regularization to trade off the rankand accuracy. We select λs by cross-validation in this work.The gradient of common weight-wise regularization (e.g.,`2-norm) is omitted in Eq. (3) for simplicity.

Fig. 4 intuitively explains our method. Suppose eachvector wi is a rigid stick and there is a particle fixed atthe endpoint. The particle has unit mass, and the stick ismassless and can freely spin around the origin. Given thepair-wise attractive forces (e.g., universal gravitation) fji,Eq. (2) is the acceleration of particle i. As the forces are at-tractive, neighbor particles tend to spin around the origin toassemble together. Although our regularizer seems to col-lapse all particles to one point which is the rank-one spacefor most lightweight DNNs, there exist gradients of dataloss to avoid this. More specific, pre-trained filters orientto discriminative directions wn (n ∈ [1...N ]). In each di-rection wn, there are some correlated filters as observed inFig. 2. During the subsequent retraining with our regular-izer, regularization gradients coordinate a cluster of filterscloser to a typical direction dm (m ∈ [1...M ],M � N),but data loss gradients avoid collapsing dm together so as tomaintain the filters’ capability of extracting discriminativefeatures. If all filters could be extremely collapsed towardone point meanwhile maintain classification accuracy, it im-plies the filters are over-redundant and we can attain a veryefficient DNN by decomposing it to a rank-one space.

We derive the Force Regularization gradient from thenormalized filters based on the following facts: (1) A nor-malized filter is on the unit hypersphere, and its orientationis the only free parameter we need to optimize; (2) The gra-dient of Wi can be easily scaled by the vector length ||Wi||without changing the angular velocity.

In Eq. (2), fji = f(wj−wi) is the force function relatedto distance. We study `2-norm Force

f`2(wj −wi) = wj −wi (4)

and `1-norm Force

f`1(wj −wi) =wj −wi

||wj −wi||(5)

in this work. We define the force of Eq. (4) as `2-normForce because the strength linearly decreases with the dis-tance ||wj −wi||, just as the gradient of regularization `2-norm does. We name the force of Eq. (5) as `1-norm Forcebecause the gradient is a constant unit vector regardless ofthe distance, just as the gradient of sparsity regularization`1-norm is.

4.2. Mathematical Implications

This section explains the mathematical implications be-hind: Force Regularization is related to but different fromminimizing the sum of pair-wise distances between normal-ized filters.

Theorem 1 Suppose filterWn ∈ W is reshaped as a vectorWn ∈ R1×CHW and normalized as wn ∈ R1×CHW (∀n ∈

Table 1. Ranks vs. scalers of step sizes of regularization gradients.

Scaler Error conv1* conv2 conv3

0 (baseline) 18.0% 17/32 27/32 55/64||Wi|| 17.9% 15/32 22/32 30/64

1/||Wi|| 18.0% 16/32 27/32 32/64* The first convolutional layer.

[1...N ]). For each filter, Force Regularization under `2-norm force has the same gradient direction of regulariza-tionR(W), but differs by adapting the step size to the filer’slength, where

R(W) =1

2

N∑j=1

N∑i=1

∣∣∣∣∣∣∣∣ Wj

||Wj ||− Wi

||Wi||

∣∣∣∣∣∣∣∣2 . (6)

Proof : Because wj =Wj

||Wj || ,

∂R(W)

∂Wi=

1

2

N∑j=1

∂ (wj −wi) (wj −wi)T

∂Wi

=1

2

N∑j=1

∂(1− 2wjw

Ti + 1

)∂Wi

= −N∑j=1

∂(wjw

Ti

)∂Wi

= −N∑j=1

wj∂wT

i

∂Wi,

(7)

where ∂wTi

∂Wi:= Gi is a derivative matrix with element

G(pq)i =

∂w(p)i

∂W(q)i

=∂

W(p)i

||Wi||

∂W(q)i

=1

||Wi||

(δ(p, q)− W

(p)i W

(q)i

||Wi||2

).

(8)

Superscripts p, q ∈ [1 . . . CHW ] index the elements invectors wi and Wi. δ(p, q) is the unit impulse function:

δ(p, q) =

{1 p = q

0 p 6= q. (9)

Therefore,

Gi =1

||Wi||(I−wT

i wi

). (10)

Replacing Eq. (10) to Eq. (7), we have

−∂R(W)

∂Wi=

1

||Wi||

N∑j=1

((wj −wi)− (wj −wi)w

Ti wi

)

=1

||Wi||

N∑j=1

fji

− N∑

j=1

fji

wTi wi

,

(11)

where fji = f`2(wj −wi) = wj −wi. Therefore, Eq. (11)and Eq. (2) have the same direction.

Theorem 1 states that our proposed Force Regularizationin Eq. (2) is related to Eq. (11). However, the step size ofthe gradient in Eq. (2) is scaled by the length ||Wi|| of thefilter instead of its reciprocal in Eq. (11). This ensures thatthe filter spins the same angle regardless of its length andavoids the issue of being divided by zero. Table 1 summa-rizes the ranks vs. step sizes for the ConvNet [16], which istrained by CIFAR-10 database without data augmentation.The original ConvNet has 32, 32, and 64 filters in each con-volutional layer, respectively. The rank is the smallest num-ber of basis filters (in Fig. 3) obtained by PCA with ≤ 5%reconstruction error. Therefore, ||Wi|| works better than itsreciprocal when coordinating filters to a lower-rank space.

Following the same proof procedure, we can easily findthat Force Regularization under `1-norm Force has the sameconclusion when

R(W) =

N∑j=1

N∑i=1

∣∣∣∣∣∣∣∣ Wj

||Wj ||− Wi

||Wi||

∣∣∣∣∣∣∣∣ . (12)

5. Experiments

5.1. Implementation

Our experiments are performed in Caffe [13] usingCIFAR-10 [15] and ILSVRC-2012 ImageNet [3]. Pub-lished models are adopted as the baselines: In CIFAR-10,we choose ConvNet without data augmentation [16] andResNets-20 with data augmentation [10]. We adopt thesame shortcut connections in [28] for ResNets-20. For Im-ageNet, we use AlexNet and GoogLeNet models trained byCaffe, and report accuracy using only center crop of images.

Our experiments of Force Regularization show that, withthe same maximum iterations, the training from the baselinecan achieve a better tradeoff between accuracy and speedupcomparing with the training from scratch, because the base-line offers a good initial point for both accuracy and filtercorrelation. During the training with Force Regularizationon CIFAR-10, we use the same base learning rate as thebaseline; while in ImageNet, 0.1× base learning rate of thebaseline is adopted.

0

20

40

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Baseline L2-norm force

0

100

200

300

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 Convolutional layer #

ResNets-20

GoogLeNet

Figure 5. The rank M in each convolutional layer of ResNets-20and GoogLeNet. Red bar overlaps blue bar. The accuracy loss is0.75% for ResNets-20 and 2.46% (top-5) for GoogLeNet.

5.2. Rank Analysis of Coordinated DNNs

In light of various low-rank approximation methods,without losing the generalization, we first adopt PrincipalComponent Analysis (PCA) [30][22] to evaluate the effec-tiveness of Force Regularization. Specifically, the filtertensor W can be reshaped to a matrix W ∈ RN×CHW ,the rows of which are the reshaped filters Wn (∀n ∈[1...N ]). PCA minimizes the least square reconstructionerror when projecting a column (RN ) of W to a low-rank space RM (M � N). The reconstruction error iseM =

∑Ni=M+1 λi, where λi is the i-th largest eigenvalue

of covariance matrix WWT

CHW−1 . Under the constraint of errorpercentage eM

e0(e.g., eM

e0≤ 5%), lower-rank approximation

can be obtained if the minimal rank M can be smaller. Inthis section, without explicit explanation, we define rank Mof a convolutional layer as the minimal M which has ≤ 5%reconstruction error by PCA.

Table 2 summarizes the rank M in each layer of Con-vNet and AlexNet without accuracy loss after Force Regu-larization. In the baselines, the learned filters in the frontlayers are intrinsically in a very low-rank space but the rankM in deeper layers is high. This could explain why onlyspeedups of the first two convolutional layers were reportedin [5]. Fortunately, by using either `2-norm or `1-normforce, our method can efficiently maintain the low rank Min the first two layers (e.g., conv1-conv2 in AlexNet), mean-while significantly reduce the rankM of deeper layers (e.g.,conv3-conv5 in AlexNet). On average, our method can re-duce the layer-wise rank ratio by ∼ 50%. The effective-ness of our method on deep layers is very important as the

Table 2. The rank M in each convolutional layer after Force Regularization.Net Force Top-1 error conv1 conv2 conv3 conv4 conv5 Average rank ratio ‡

ConvNet None (baseline)† 18.0% 17/32‡ 27/32 55/64 – – 74.48%ConvNet `2-norm 17.9% 15/32 22/32 30/64 – – 54.17%ConvNet `1-norm 18.0% 17/32 25/32 20/64 – – 54.17%

AlexNet None (baseline) 42.63% 47/96 164/256 306/384 318/384 220/256 72.29%AlexNet `2-norm 42.70% 49/96 143/256 128/384 122/384 161/256 46.98%AlexNet `1-norm 42.45% 49/96 155/256 157/384 108/384 178/256 50.03%†The baseline without Force Regularization. ‡M /N : Low rank M over full rank N , which is defined as rank ratio.

Figure 6. Th rank ratio (having ≤ 5% PCA reconstruction error)in each layer vs. top-1 error for AlexNet. Horizontal dotted linesrepresent the rank ratios of the baseline, and vertical dotted lineis the error of baseline. Solid (dashed) curves depict rank ratiosof the AlexNet after Force Regularization by `2-norm (`1-norm)force. Each layer is denoted by a typical color. The sensitivity ofhyper-parameter λs: along the direction from left to right, λs of`2-norm force changes from 1.2e-5, to 1.8e-5, 2.0e-5, 3.0e-5, and3.5e-5; and for `1-norm force, it changes from 1.5e-5, to 1.8e-5,2.0e-5, and 2.5e-5.

depth of modern DNNs grows dramatically [25][10]. Fig. 5shows the rank M of ResNets-20 [10] and GoogLeNet [25]after Force Regularization, representing the scalability ofour method on deeper DNNs. With an acceptable accuracyloss, 5 layers in ResNets-20 and 6 layers in GoogLeNet areeven coordinated to rank M = 1, which indicates those In-ception blocks in GoogLeNet or Residual blocks in ResNetshave been over-parameterized and can be greatly simplified.

To study the trade-off between rank, accuracy, and thepros and cons of `2-norm and `1-norm force, we conductedcomprehensive experiments on AlexNet. As shown in Fig. 6,with mere 1.71% (1.80%) accuracy loss, the average rankratio can be reduced to 28.59% (28.72%) using `2-norm (`1-norm) force. Very impressively, the rank M of each groupin conv4 can be reduced to one by `1-norm force. The re-sults also show that `2-norm force is more effective than`1-norm force when the rank ratio is high (e.g., conv2 andconv5), while `1-norm force works better for layers whosepotential rank ratios are low (e.g., conv3 and conv4). Ingeneral, `2-norm force can better balance the ranks acrossall the layers.

Because Force Regularization coordinates more usefulweight information in a low-rank space, it essentially canprovide a better training initialization for the DNNs that aredecomposed by LRA. Fig. 7 plots the training data loss andtop-1 validation error of AlexNet, which is decomposed tothe same ranks by PCA. The baseline is the original AlexNetand the other AlexNet is coordinated by Force Regulariza-

0.5

1

1.5

2

2.5

0 50000 100000 150000

(a) Training data loss

baselineforce regularization

43%

44%

45%

46%

47%

48%

49%

50%

0 50000 100000 150000

(b) Top-1 validation error

baselineforce regularization

Figure 7. Training data loss and top-1 validation error vs. iterationwhen fine-tuning AlexNet which is decomposed to the same ranks.

tion. The figure shows that the error sharply converges toa low level after a few iterations, indicating LRA providesa very good initialization for the low-rank DNNs. Train-ing it from scratch has significant accuracy loss. More im-portantly, DNNs coordinated by Force Regularization canconverge faster to a lower error.

Besides PCA [22][30], we also evaluated the effec-tiveness of Force Regularization when integrating it withSVD [5][26] or k-means clustering [5][2]. Table 3 com-pares the accuracies of AlexNet decomposed by differentLRA methods. All LRAs preserve the same ranks in all lay-ers, which means the decomposed AlexNet have the samenetwork structure. In summary, PCA and SVD obtain sim-ilar accuracy and surpass k-means clustering. Due to thelimited pages, we adopt PCA as the representative in ourstudy.

5.3. Acceleration of DNN Testing

In our experiments, we first train DNNs with Force Regu-larization, then decompose DNNs using LRA methods andfine-tune them to recover accuracy. In evaluation of speed,we omit small CIFAR-10 database and focus on large-scaleDNNs on ImageNet, whose speed is a real concern. Toprove the effective acceleration of Force Regularization, weadopt the speedup of state-of-the-art LRAs [30][4][26] asour baseline. Our speedup is achieved in the case that theDNN filters are first coordinated by Force Regularizationand then decomposed using the same LRAs. The practicalGPU speed is profiled by the advanced hardware (NVIDIA

Table 3. The accuracy of different LRA under the same ranks.Force LRA Top-1 error

NonePCA 43.21%SVD† 43.27%

k-means† 44.34%

`2-normPCA 43.25%SVD† 43.20%

k-means† 44.80%† SVD and k-means preserve the same ranks with PCA

Table 4. The higher speedups of AlexNet by Force Regularization.

Force Top-1 error conv3 conv4 conv5

None 43.21% rank 184 201 146`2-norm 43.25% rank 124 106 129

None 43.21% GPU 1.58× 1.21× 1.15×`2-norm 43.25% GPU 2.16× 2.03× 1.33×

None 43.21% CPU 1.78× 1.60× 1.47×`2-norm 43.25% CPU 2.45× 2.76× 1.64×

None 43.21% theoretical 1.79× 1.72× 1.63×`2-norm 43.25% theoretical 2.65× 3.26× 1.85×

GTX 1080) and software (cuDNN 5.0). The CPU speed ismeasured in Intel Xeon E5-2630 and ATLAS library. Thebatch size is 256.

Cross-filter LRA: We first evaluate the speedup ofcross-filter LRA shown in Fig. 3. In previous works [5][26],the optimal rank in each layer can be selected layer-by-layer using cross validation. However, the number of hyper-parameters increases linearly with the depth of DNNs. Tosave development time, we utilize an identical error per-centage eM

e0across all layers as the single hyper-parameter

although layer-wise rank selection may give better tradeoff.The rank in a layer is the minimalM which has error≤ eM

e0.

As aforementioned in Section 5.2 and Table 2, thelearned conv1 and conv2 of AlexNet are already in a verylow-rank space and achieve good speedups using LRAs [5].Thus we mainly focus on conv3-conv5 here. Table 4 sum-marizes the speedups of PCA approximation of AlexNetwith and without `2-norm Force Regularization. With igno-ble accuracy difference, Force Regularization successfullycoordinates filters to a lower-rank space and accelerates thetesting by a higher factor, comparing with the state-of-the-art LRA. Similar results are observed when applying `1-norm force.

Results in Table 4 also show that practical speedup is dif-ferent from theoretical speedup. Generally, the difference issmaller in lower-performance processors. In CPU mode ofTable 4, Force Regularization achieves 2× speedup of totalconvolutional time.

Speeding up state-of-the-art LRA: We also duplicatethe state-of-the-art work [26] as the baseline2 (lra1). AfterLRA, AlexNet is fine-tuned with learning rate starting from0.001 and divided by 10 at iteration 70,000 and 140,000.Fine-tuning terminates after 150,000 iterations.

The first row in Table 5 contains the results of the base-line [26], which don’t scale well to the advanced “TITAN1080 + cuDNN 5.0” in conv3–5. This is because 3 × 3convolution is highly optimized in cuDNN 5.0, e.g., us-ing Winograd’s minimal filtering algorithms [17]. However,the baseline decomposes the 3 × 3 convolution to a pair of

2Code is provided by the authors in https://github.com/chengtaipu/lowrankcnn/

Table 5. The higher speedup factors by force regularization.LRA Force Top-5 err. conv3 conv4 conv5

lra1 [26] None 20.65% GPU 0.86× 0.57× 0.40×lra2 None 19.93% GPU 1.89× 1.57× 1.57×lra2 `2-norm 20.14% GPU 2.25× 2.03× 1.60×

lra2 `2-norm 21.68% GPU 3.56× 3.01× 2.40×CPU 4.81× 4.00× 2.92×

3 × 1 and 1 × 3 convolution so that the optimized cuDNNis not fully exploited. This will be a common issue in thebaseline, considering Winograd’s algorithm is universallyused and 3 × 3 convolution is one of the most commonstructures. We find that LRA in Fig. 3 can be utilized forconv 3–5 to solve this issue, because it can maintain the3×3 shape. We name this LRA as lra2, which decomposesconv1–conv2 using LRA in [26] and conv 3–5 using LRA ofFig. 3. The second row in Table 5 shows that our lra2 canscale well to the hardware and software advances of “TI-TAN 1080 + cuDNN 5.0”. More importantly, Force Regu-larization on conv3–5 can enforce them to more lightweightlayers and attain higher speedup factors than lra2 withoutusing it. The result is shown in the third row, which in totalachieves 2.03× speedup for the whole convolution in GPU.With small accuracy loss in row 4 of Table 5, Force Regu-larization achieves 2.50× speedup of total convolution onGPU and 4.05× on CPU.

Table 6 compares our method with state-of-the-art DNNacceleration methods, in CPU mode. When the speedup oftotal time was not reported by the authors, we estimate itby the weighted average speedups over all layers, wherethe weighting coefficients are derived from the percent-age of running time of each layer. In our hardware plat-form, conv1–conv5 respectively consume 15.89%, 28.25%,24.32%, 18.70% and 12.84% testing time. The estimationis accurate, for example, we estimate 2.58× of total timein one-shot [14], which is very close to 2.52× reported bythe authors. Comparing with both cp-decomposition andone-shot methods, our method can achieve higher accuracyand higher speedup. Comparing with SSL, with almost thesame top-5 error (21.68% vs. 21.63%), we can attain higherspeedup of 4.05× vs. 3.13×.

deep-compression [7] reported 3× to 4× speedups infully-connected layers when batch size was 1. However,convolution is the bottleneck of DNNs, e.g., the convolutiontime in AlexNet is 5× of the time in fully-connected layerswhen profiled in our CPU platform. Moreover, no speedupwas observed in the batching scenario as reported by the au-thors [7]. More importantly, as we will show in Section 5.4,our work can work together with sparsity-based methods(e.g., SSL or deep-compression) to obtain lower-rank andsparse DNNs and potentially further accelerate the testingof DNNs.

https://github.com/chengtaipu/lowrankcnn/

https://github.com/chengtaipu/lowrankcnn/

Table 6. Comparison of speedup factor on AlexNet by state-of-the-art DNN acceleration methods.

Method Top-5 err. conv1 conv2 conv3 conv4 conv5 total

AlexNet in Caffe 19.97% 1.00× 1.00× 1.00× 1.00× 1.00× 1.00×cp-decomposition [18] 20.97% (+1.00%) – 4.00× – – – 1.27×

one-shot [14] 21.67% (+1.70%) 1.48× 2.30× 3.84× 3.53× 3.13× 2.52×

SSL [28] 19.58% (-0.39%) 1.00× 1.27× 1.64× 1.68× 1.32× 1.35×21.63% (+1.66%) 1.05× 3.37× 6.27× 9.73× 4.93× 3.13×

our lra220.14% (+0.17%) 2.61× 6.06× 2.48× 2.20× 1.58× 2.69×21.68% (+1.71%) 2.65× 6.22× 4.81× 4.00× 2.92× 4.05×

5.4. Lower-rank and Sparse DNNs

We sparsify the lightweight deep neural network (i.e.,the first one of lra2 in Table 6), using Structured Spar-sity Learning SSL [28] or non-structured connection-pruning [23]. Note that Guided Sparsity Learning (GSL) isnot adopted in our connection-pruning though better spar-sity is achievable when applying it. Figure 8 summarizesthe results.

Experiments prove that our method can work togetherwith both structured and non-structured sparsity methods tofurther compress and accelerate models. Comparing withdeep-compression in Figure 8(a), our model has compara-ble compression rates but 2.69× faster testing time. Typ-ically, our model has higher compression rates in convo-lutional layers, which provides more space for computationreduction and generalizes better to modern DNNs (ResNets-152 [10], for example, whose parameters in fc layers areonly 4%). In Figure 8(b), our accelerated model can befurther accelerated using SSL. The shape-wise sparsity in

0

2

4

6

8

10

12

Com

pres

sion

rat

e

deep-compression lra2+connection-pruning

0102030405060708090

% st

ruct

ured

spar

sity

SSL lra2+SSL(a) (b)

Figure 8. The results of sparsifying lightweight DNNs whose fil-ters are coordinated to a lower-rank space by Force Regularization.In terms of deep-compression in (a), we only count the compres-sion rate obtained from connection pruning for a fair comparison,but quantization and Huffman coding can also be utilized to im-prove the compression rate for our model. Based on SSL in (b), weenforce shape-wise sparsity on conv3 s, conv4 s and conv5 s tolearn the shapes of basis filters meanwhile enforce filter-wise spar-sity on conv3 f and conv4 f to learn the number of filters [28]. Aseach convolutional layer in the lra2 is decomposed to two smalllayers, we respectively denote the first and second small layer bysuffixing “ s” and “ f”. The baseline and our model have the sameaccuracy.

conv3–5 of our model is slightly lower because our model isalready aggressively compressed by LRA. The higher filter-wise sparsity, however, implies the orthogonality of our ap-proach to SSL.

5.5. Generalization of Force Regularization

In convolutional layers, each filter basically extracts adiscriminative feature, e.g., an orientation-selective patternor a color blob in the first layer [16] or a high-level fea-ture (e.g., textures, faces, etc.) in deeper layers [29]. Thediscrimination among filters is important for classificationperformance. Our method can coordinate filters for morelightweight DNNs meanwhile maintain the discrimination.It can also be generalized to learn more discriminative fil-ters to improve the accuracy. The extension to Discrimi-nation Regularization is straightforward but effective: theopposite gradient of Force Regularization (i.e., λs < 0) isutilized to update the filter. In this scenario, it works as therepulsive force to repel surrounding filters and enhance thediscrimination. Table 7 summarizes the improved accuracyof state-of-the-art DNNs.

Acknowledgments

This work was supported in part by NSF CCF-1744082.Any opinions, findings and conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of NSF or their contrac-tors.

Table 7. Improved accuracy with Discrimination Regularization.

Net Regularization Top-1 error

AlexNet None (baseline) 42.63%AlexNet `2-norm force 41.71%AlexNet `1-norm force 41.53%

ResNets-20 None (baseline) 8.82%ResNets-20 `2-norm force 7.97%ResNets-20 `1-norm force 8.02%

References[1] J. M. Alvarez and M. Salzmann. Learning the number of

neurons in deep networks. In Advances in Neural Informa-tion Processing Systems (NIPS), pages 2262–2270, 2016. 2

[2] C. Bauckhage. k-means clustering is matrix factorization.arXiv:1512.07548, 2015. 2, 6

[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2009. 5

[4] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Fre-itas. Predicting parameters in deep learning. In Advances inNeural Information Processing Systems (NIPS). 2013. 1, 2,3, 6

[5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-gus. Exploiting linear structure within convolutional net-works for efficient evaluation. In Advances in Neural In-formation Processing Systems (NIPS). 2014. 1, 2, 5, 6, 7

[6] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery forefficient dnns. In Advances in Neural Information Process-ing Systems (NIPS). 2016. 1, 2

[7] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-pressing deep neural network with pruning, trained quanti-zation and huffman coding. arXiv:1510.00149, 2015. 1, 2,7

[8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weightsand connections for efficient neural network. In Advances inNeural Information Processing Systems (NIPS). 2015. 1, 2

[9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenetclassification. In International Conference on Computer Vi-sion (ICCV), 2015. 1

[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016. 1, 5, 6, 8

[11] Y. Ioannou, D. P. Robertson, J. Shotton, R. Cipolla, andA. Criminisi. Training cnns with low-rank filters for efficientimage classification. arXiv:1511.06744, 2015. 1, 2

[12] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding upconvolutional neural networks with low rank expansions.In Proceedings of the British Machine Vision Conference(BMVC), 2014. 1, 2

[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. arXiv:1408.5093,2014. 5

[14] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.Compression of deep convolutional neural networks for fastand low power mobile applications. arXiv:1511.06530,2015. 1, 2, 7, 8

[15] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009. 5

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems (NIPS).2012. 1, 2, 5, 8

[17] A. Lavin. Fast algorithms for convolutional neural networks.arXiv:1509.09308, 2015. 7

[18] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lem-pitsky. Speeding-up convolutional neural networks usingfine-tuned cp-decomposition. arXiv:1412.6553, 2014. 1, 2,8

[19] V. Lebedev and V. Lempitsky. Fast convnets using group-wise brain damage. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016. 1, 2

[20] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D.Jackel. Optimal brain damage. In Advances in Neural In-formation Processing Systems (NIPS), volume 2, pages 598–605, 1989. 2

[21] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf.Pruning filters for efficient convnets. In International Con-ference on Learning Representations (ICLR), 2017. 2

[22] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.Sparse convolutional neural networks. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2015.1, 2, 5, 6

[23] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, andP. Dubey. Faster cnns with direct sparse convolutions andguided pruning. In International Conference on LearningRepresentations (ICLR), 2017. 1, 2, 8

[24] K. Simonyan and A. Zisserman. Very deep con-volutional networks for large-scale image recognition.arXiv:1409.1556, 2014. 1, 2

[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015. 1,2, 6

[26] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neu-ral networks with low-rank regularization. In InternationalConference on Learning Representations (ICLR), 2016. 1, 2,3, 6, 7

[27] P. Wang and J. Cheng. Accelerating convolutional neuralnetworks for mobile applications. In Proceedings of the 2016ACM on Multimedia Conference, 2016. 1, 2

[28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningstructured sparsity in deep neural networks. In Advances inNeural Information Processing Systems (NIPS). 2016. 1, 2,5, 8

[29] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In European Conference on Com-puter Vision (ECCV), 2014. 8

[30] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating verydeep convolutional networks for classification and detection.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 38(10):1943–1955, Oct 2016. 1, 2, 3, 5, 6

coordinating filters for faster deep neural networkscoordinating filters for faster deep neural...

Documents