reactnet: towards precise binary neural network with ......reactnet: towards precise binary neural...

ReActNet: Towards Precise Binary NeuralNetwork with Generalized Activation Functions

Zechun Liu1,2?, Zhiqiang Shen2†, Marios Savvides2, and Kwang-Ting Cheng1

1 Hong Kong University of Science and Technology, 2 Carnegie Mellon [email protected], {zhiqians,marioss}@andrew.cmu.edu, [email protected]

Abstract. In this paper, we propose several ideas for enhancing a bi-nary network to close its accuracy gap from real-valued networks withoutincurring any additional computational cost. We first construct a base-line network by modifying and binarizing a compact real-valued networkwith parameter-free shortcuts, bypassing all the intermediate convolu-tional layers including the downsampling layers. This baseline networkstrikes a good trade-off between accuracy and efficiency, achieving su-perior performance than most of existing binary networks at approxi-mately half of the computational cost. Through extensive experimentsand analysis, we observed that the performance of binary networks issensitive to activation distribution variations. Based on this importantobservation, we propose to generalize the traditional Sign and PReLUfunctions, denoted as RSign and RPReLU for the respective general-ized functions, to enable explicit learning of the distribution reshapeand shift at near-zero extra cost. Lastly, we adopt a distributional lossto further enforce the binary network to learn similar output distribu-tions as those of a real-valued network. We show that after incorporatingall these ideas, the proposed ReActNet outperforms all the state-of-the-arts by a large margin. Specifically, it outperforms Real-to-Binary Netand MeliusNet29 by 4.0% and 3.6% respectively for the top-1 accuracyand also reduces the gap to its real-valued counterpart to within 3.0%top-1 accuracy on ImageNet dataset. Code and models are available at:https://github.com/liuzechun/ReActNet.

1 Introduction

The 1-bit convolutional neural network (1-bit CNN, also known as binary neu-ral network) [7,30], of which both weights and activations are binary, has beenrecognized as one of the most promising neural network compression methodsfor deploying models onto the resource-limited devices. It enjoys 32× memorycompression ratio, and up to 58× practical computational reduction on CPU, asdemonstrated in [30]. Moreover, with its pure logical computation (i.e., XNORoperations between binary weights and binary activations), 1-bit CNN is bothhighly energy-efficient for embedded devices [8,40], and possesses the potentialof being directly deployed on next generation memristor-based hardware [17].

? Work done while visiting CMU. † Corresponding author.

https://github.com/liuzechun/ReActNet

2 Z. Liu et al.

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

ECCV#2106

ECCV#2106

ECCV-20 submission ID 2106 15

Methods OPs (⇥ 108) Top-1 Acc(%)Real-to-Binary Baseline [3] 1.63 60.9Our ReAct Baseline Net 0.87 61.1

XNORNet [29] 1.67 51.2Bi-RealNet [23] 1.63 56.4

Real-to-Binary [3] 1.65 65.4MeliusNet29 [2] 2.14 65.8

Our ReActNet-A 0.87 69.4

MeliusNet59 [2] 5.32 70.7Our ReActNet-C 2.14 71.4

Table 3.

References

1. Alizadeh, M., Fernández-Marqués, J., Lane, N.D., Gal, Y.: An empirical study ofbinary neural networks’ optimisation (2018) 4

2. Bethge, J., Bartz, C., Yang, H., Chen, Y., Meinel, C.: Meliusnet: Can binary neuralnetworks achieve mobilenet-level accuracy? arXiv preprint arXiv:2001.05936 (2020)4, 10, 11, 15

3. Brais Martinez, Jing Yang, A.B.G.T.: Training binary neural networks with real-to-binary convolutions. International Conference on Learning Representations (2020)2, 3, 4, 6, 9, 10, 11, 12, 15

4. Bulat, A., Tzimiropoulos, G.: Xnor-net++: Improved binary neural networks.British Machine Vision Conference (2019) 4, 6

5. Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave gaussian quantization. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 5918–5926 (2017) 11

6. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning e�cient objectdetection models with knowledge distillation. In: Advances in Neural InformationProcessing Systems. pp. 742–751 (2017) 3

7. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neuralnetworks: Training deep neural networks with weights and activations constrainedto+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016) 1, 3, 11

8. Ding, R., Chin, T.W., Liu, Z., Marculescu, D.: Regularizing activation distributionfor training binarized deep networks. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 11408–11417 (2019) 1, 2

9. Ding, X., Zhou, X., Guo, Y., Han, J., Liu, J., et al.: Global sparse momentumsgd for pruning very deep neural networks. In: Advances in Neural InformationProcessing Systems. pp. 6379–6391 (2019) 3

10. Faraone, J., Fraser, N., Blott, M., Leong, P.H.: Syq: Learning symmetric quanti-zation for e�cient deep neural networks. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4300–4309 (2018) 11

11. Gu, J., Li, C., Zhang, B., Han, J., Cao, X., Liu, J., Doermann, D.: Projectionconvolutional neural networks for 1-bit cnns via discrete back propagation. In:Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8344–8351 (2019) 4, 11

Methods OPs (×108) Top-1 Acc(%)Real-to-Binary Baseline [3] 1.63 60.9Our ReAct Baseline Net 0.87 61.1

XNORNet [30] 1.67 51.2Bi-RealNet [23] 1.63 56.4

Real-to-Binary [3] 1.65 65.4MeliusNet29 [2] 2.14 65.8

Our ReActNet-A 0.87 69.4

MeliusNet59 [2] 5.32 70.7Our ReActNet-C 2.14 71.4

Fig. 1. Computational cost vs. ImageNet Accuracy. Proposed ReActNets signif-icantly outperform other binary neural networks. In particular, ReActNet-C achievesstate-of-the-art result with 71.4% top-1 accuracy but being 2.5× more efficient thanMeliusNet59. ReActNet-A exceeds Real-to-Binary Net and MeliusNet29 by 4.0% and3.6% top-1 accuracy, respectively, and with more than 1.9× computational reduction.Details are described in Section 5.2.

Despite these attractive characteristics of 1-bit CNN, the severe accuracydegradation prevents it from being broadly deployed. For example, a representa-tive binary network, XNOR-Net [30] only achieves 51.2% accuracy on the Ima-geNet classification dataset, leaving a ∼ 18% accuracy gap from the real-valuedResNet-18. Some preeminent binary networks [8,37] show good performance onsmall datasets such as CIFAR10 and MNIST, but still encounter severe accuracydrop when applied to a large dataset such as ImageNet.

In this study, our motivation is to further close the performance gap betweenbinary neural networks and real-valued networks on the challenging large-scaledatasets. We start with designing a high-performance baseline network. Inspiredby the recent advances in real-valued compact neural network design, we chooseMobileNetV1 [15] structure as our binarization backbone, which we believe is ofgreater practical value than binarizing non-compact models. Following the in-sights highlighted in [23], we adopt blocks with identity shortcuts which bypass1-bit vanilla convolutions to replace the convolutions in MobileNetV1. Moreover,we propose to use a concatenation of two of such blocks to handle the channelnumber mismatch in the downsampling layers, as shown in Fig. 2(a). This base-line network design not only helps avoid real-valued convolutions in shortcuts,which effectively reduces the computation to near half of that needed in preva-lent binary neural networks [30,23,3], but also achieves a high top-1 accuracy of61.1% on ImageNet.

To further enhance the accuracy, we investigate another aspect which has notbeen studied in previous binarization or quantization works: activation distribu-tion reshaping and shifting via non-linearity function design. We observed thatthe overall activation value distribution affects the feature representation, andthis effect will be exaggerated by the activation binarization. A small distribu-tion value shift near zero will cause the binarized feature map to have a disparateappearance and in turn will influence the final accuracy. This observation will beelaborated in Section 4.2. Enlightened by this observation, we propose a new gen-

ReActNet 3

eralization of Sign function and PReLU function to explicitly shift and reshapethe activation distribution, denoted as ReAct-Sign (RSign) and ReAct-PReLU(RPReLU) respectively. These activation functions adaptively learn the param-eters for distributional reshaping, which enhance the accuracy of the baselinenetwork by ∼ 7% with negligible extra computational cost.

Furthermore, we propose a distributional loss to enforce the output distribu-tion similarity between the binary and real-valued networks, which further booststhe accuracy by ∼ 1%. After integrating all these ideas, the proposed network,dubbed as ReActNet, achieves 69.4% top-1 accuracy on ImageNet with only 87MOPs, surpassing all previously published works on binary networks and reducethe accuracy gap from its real-valued counterpart to only 3.0%, shown in Fig. 1.

We summarize our contributions as follows: (1) We design a baseline binarynetwork by modifying MobileNetV1, whose performance already surpasses mostof the previously published work on binary networks while incurring only halfof the computational cost. (2) We propose a simple channel-wise reshaping andshifting operation on the activation distribution, which helps binary convolutionsspare the computational power in adjusting the distribution to learn more rep-resentative features. (3) We further adopt a distributional loss between binaryand real-valued network outputs, replacing the original loss, which facilitatesthe binary network to mimic the distribution of a real-valued network. (4) Wedemonstrate that our proposed ReActNet, achieves 69.4% top-1 accuracy onImageNet, for the first time, exceeding the benchmarking ResNet-level accuracy(69.3%). This result also outperforms the state-of-the-art binary network [3] by4.0% top-1 accuracy while incurring only half the OPs1.

2 Related Work

There have been extensive studies on neural network compression and accelera-tion, including quantization [46,39,43], pruning [9,12,24,22], knowledge distilla-tion [14,33,6] and compact network design [15,32,25,41]. A comprehensive surveycan be found in [35]. The proposed method falls into the category of quantiza-tion, specifically the extreme case of quantizing both weights and activations toonly 1-bit, which is so-called network binarization or 1-bit CNNs.

Neural network binarization originates from EBP [34] and BNN [7], whichestablish an end-to-end gradient back-propagation framework for training thediscrete binary weights and activations. As an initial attempt, BNN [7] demon-strated its success on small classification datasets including CIFAR10 [16] andMNIST [27], but encountered severe accuracy drop on a larger dataset such asImageNet [31], only achieving 42.2% top-1 accuracy compared to 69.3% of thereal-valued version of the ResNet-18.

Many follow-up studies focused on enhancing the accuracy. XNOR-Net [30],which proposed real-valued scaling factors to multiply with each of binary weightkernels, has become a representative binarization method and enhanced the top-1 accuracy to 51.2%, narrowing the gap to the real-valued ResNet-18 to ∼18%.1 OPs is a sum of binary OPs and floating-point OPs, i.e., OPs = BOPs/64 + FLOPs.

4 Z. Liu et al.

Based on the XNOR-Net design, Bi-Real Net [23] proposed to add shortcutsto propagate real-values along the feature maps, which further boost the top-1accuracy to 56.4%.

Several recent studies attempted to improve the binary network performancevia expanding the channel width [26], increasing the network depth [21] or usingmultiple binary weight bases [19]. Despite improvement to the final accuracy, theadditional computational cost offsets the BNN’s high compression advantage.

For network compression, the real-valued network design used as the startingpoint for binarization should be compact. Therefore, we chose MobileNetV1 asthe backbone network for development of our baseline binary network, whichcombined with several improvements in implementation achieves ∼ 2× furtherreduction in the computational cost compared to XNOR-Net and Bi-Real Net,and a top-1 accuracy of 61.1%, as shown in Fig. 1.

In addition to architectural design [2,23,28], studies on 1-bit CNNs expandfrom training algorithms [36,1,46,3], binary optimizer design [13], regulariza-tion loss design [8,29], to better approximation of binary weights and activa-tions [30,11,37]. Different from these studies, this paper focuses on a new aspectthat is seldom investigated before but surprisingly crucial for 1-bit CNNs’ ac-curacy, i.e. activation distribution reshaping and shifting. For this aspect, wepropose novel ReAct operations, which are further combined with a proposeddistributional loss. These enhancements improve the accuracy to 69.4%, furthershrinking the accuracy gap to its real-valued counterpart to only 3.0%. The base-line network design and ReAct operations, as well as the proposed loss functionare detailed in Section 4.

3 Revisit: 1-bit Convolution

In a 1-bit convolutional layer, both weights and activations are binarized to -1and +1, such that the computationally heavy operations of floating-point matrixmultiplication can be replaced by light-weighted bitwise XNOR operations andpopcount operations [4], as:

Xb ∗Wb = popcount(XNOR(Xb,Wb)), (1)

whereWb and Xb indicate the matrices of binary weights and binary activations.Specifically, weights and activations are binarized through a sign function:

xb = Sign(xr) =

{+1, if xr > 0−1, if xr ≤ 0

, wb =

{+ ||Wr||l1n , if wr > 0

− ||Wr||l1n , if wr ≤ 0(2)

The subscripts b and r denote binary and real-valued, respectively. The weight

binarization method is inherited from [30], of which, ||Wr||l1n is the average of ab-solute weight values, used as a scaling factor to minimize the difference betweenbinary and real-valued weights. XNOR-Net [30] also applied similar real-valuedscaling factor to binary activations. Note that with introduction of the proposedReAct operations, to be described in Section 4.2, this scaling factor for activa-tions becomes unnecessary and can be eliminated.

ReActNet 5

ReAct Sign

1-bit 3x3 Conv

BatchNorm

ReAct PReLU

ReAct Sign

1-bit 1x1 Conv

BatchNorm

ReAct PReLU

1-bit 3x3 Conv, s=2

BatchNorm

2x2 AvgPool, s=2

Duplicate activation

Concatenate

ReAct Sign

ReAct PReLU

ReAct Sign

1-bit 1x1 Conv

BatchNorm

ReAct PReLU

ReAct Sign

1-bit 1x1 Conv

BatchNorm

ReAct PReLU

(a) Proposed baseline network block (b) Proposed ReActNet block

Sign

1-bit 3x3 Conv

BatchNorm

Sign

1-bit 1x1 Conv

BatchNorm

Sign

1-bit 1x1 Conv

BatchNorm

Sign

1-bit 3x3 Conv, s=2

BatchNorm

2x2 AvgPool, s=2

Sign

1-bit 1x1 Conv

BatchNorm

Duplicate activation

Concatenate

Normal Block Reduction BlockNormal Block Reduction Block

Fig. 2. The proposed baseline network modified from MobileNetV1 [15], which replacesthe original (3×3 depth-wise and 1×1 point-wise) convolutional pairs by the proposedblocks. (a) The baseline’s configuration in terms of channel and layer numbers is iden-tical to that of MobileNetV1. If the input and output channel numbers are equal in adw-pw-conv pair in the original network, a normal block is used, otherwise a reductionblock is adopted. For the reduction block, we duplicate the input activation and con-catenate the outputs to increase the channel number. As a result, all 1-bit convolutionshave the same input and output channel numbers and are bypassed by identity short-cuts. (b) In the proposed ReActNet block, ReAct-Sign and ReAct-PReLU are addedto the baseline network.

4 Methodology

In this section, we first introduce our proposed baseline network in Section 4.1.Then we analyze how the variation in activation distribution affects the featurequality and in turn influences the final performance. Based on this analysis, weintroduce ReActNet which explicitly reshapes and shifts the activation distri-bution using ReAct-PReLU and ReAct-Sign functions described in Section 4.2and matches the outputs via a distributional loss defined between binary andreal-valued networks detailed in Section 4.3.

4.1 Baseline Network

Most studies on binary neural networks have been binarizing the ResNet struc-ture. However, further compressing compact networks, such as the MobileNets,would be more logical and of greater interest for practical applications. Thus, wechose MobileNetV1 [15] structure for constructing our baseline binary network.

Inspired by Bi-Real Net [23], we add a shortcut to bypass every 1-bit convo-lutional layer that has the same number of input and output channels. The 3×3depth-wise and the 1×1 point-wise convolutional blocks in the MobileNetV1 [15]are replaced by the 3×3 and 1×1 vanilla convolutions in parallel with shortcuts,respectively, as shown in Fig. 2.

6 Z. Liu et al.

Moreover, we propose a new structure design to handle the downsamplinglayers. For the downsampling layers whose input and output feature map sizesdiffer, previous works [23,37,3] adopt real-valued convolutional layers to matchtheir dimension and to make sure the real-valued feature map propagating alongthe shortcut will not be “cut off” by the activation binarization. However, thisstrategy increases the computational cost. Instead, our proposal is to make surethat all convolutional layers have the same input and output dimensions so thatwe can safely binarize them and use a simple identity shortcut for activationpropagation without additional real-valued matrix multiplications.

As shown in Fig. 2(a), we duplicate input channels and concatenate twoblocks with the same inputs to address the channel number difference and alsouse average pooling in the shortcut to match spatial downsampling. All layersin our baseline network are binarized, except the first input convolutional layerand the output fully-connect layer. Such a structure is hardware friendly.

4.2 ReActNet

The intrinsic property of an image classification neural network is to learn amapping from input images to the output logits. A logical deduction is that agood performing binary neural network should learn similar logits distribution asa real-valued network. However, the discrete values of variables limit binary neu-ral networks from learning as rich distributional representations as real-valuedones. To address it, XNOR-Net [30] proposed to calculate analytical real-valuedscaling factors and multiply them with the activations. Its follow-up works [38,4]further proposed to learn these factors through back-propagation.

In contrast to these previous works, this paper focuses on a different aspect:the activation distribution. We observed that small variations to activation dis-tributions can greatly affect the semantic feature representations in 1-bit CNNs,which in turn will influence the final performance. However, 1-bit CNNs havelimited capacity to learn appropriate activation distributions. To address thisdilemma, we introduce generalized activation functions with learnable coeffi-cients to increase the flexibility of 1-bit CNNs for learning semantically-optimizeddistributions.Distribution Matters in 1-bit CNNs The importance of distribution has notbeen investigated much in training a real-valued network, because with weightsand activations being continuous real values, reshaping or moving distributionswould be effortless.

However, for 1-bit CNNs, learning distribution is both crucial and difficult.Because the activations in a binary convolution can only choose values from{−1,+1}, making a small distributional shift in the input real-valued featuremap before the sign function can possibly result in a completely different outputbinary activations, which will directly affect the informativeness in the featureand significantly impact the final accuracy. For illustration, we plot the outputbinary feature maps of real-valued inputs with the original (Fig. 3(b)), positively-shifted (Fig. 3(a)), and negatively-shifted (Fig. 3(c)) activation distributions.Real-valued feature maps are robust to the shifts with which the legibility of

ReActNet 7

Real-valuedHistogram

Real Binary Real Binary



Real Binary

(a) Negatively shifted input (b) Original input (c) Positively shifted input

Fig. 3. An illustration of how distribution shift affects feature learning in binary neuralnetworks. An ill-shifted distribution will introduce (a) too much background noise or(c) too few useful features, which harms feature learning.

semantic information will pretty much be maintained, while binary feature mapsare sensitive to these shifts as illustrated in Fig. 3(a) and Fig. 3(c).Explicit Distribution Reshape and Shift via Generalized ActivationFunctions Based on the aforementioned observation, we propose a simple yeteffective operation to explicitly reshape and shift the activation distributions,dubbed as ReAct, which generalizes the traditional Sign and PReLU func-tions to ReAct-Sign (abbreviated as RSign) and ReAct-PReLU (abbreviatedas RPReLU) respectively.Definition

Essentially, RSign is defined as a sign function with channel-wisely learnablethresholds:

xbi = h(xri ) =

{+1, if xri > αi−1, if xri ≤ αi

. (3)

Here, xri is real-valued input of the RSign function h on the ith channel, xbi is

the binary output and αi is a learnable coefficient controlling the threshold. Thesubscript i in αi indicates that the threshold can vary for different channels. Thesuperscripts b and r refer to binary and real values. Fig. 4(a) shows the shapesof RSign and Sign.

Similarly, RPReLU is defined as

f(xi) =

{xi − γi + ζi, if xi > γiβi(xi − γi) + ζi, if xi ≤ γi

, (4)

where xi is the input of the RPReLU function f on the ith channel, γi and ζiare learnable shifts for moving the distribution, and βi is a learnable coefficientcontrolling the slope of the negative part. All the coefficients are allowed to bedifferent across channels. Fig. 4(b) compares the shapes of RPReLU and PReLU.

Intrinsically, RSign is learning the best channel-wise threshold (α) for bina-rizing the input feature map, or equivalently, shifting the input distribution to

8 Z. Liu et al.

(b) PReLU vs. RPReLU

&(#)

#& # = (#

& # = #

(), +)

&(#)

#

& # = # − ) + +

& # = ( # − ) + +

ℎ(#)

#

ℎ(#)

#%

(a) Sign vs. RSign

Fig. 4. Proposed activation functions, RSign and RPReLU, with learnable coefficientsand the traditional activation functions, Sign and PReLU.

−" # = # ∗ &, )*# < 0 +.

Fig. 5. An explanation of how proposed RPReLU operates. It first moves the inputdistribution by −γ, then reshapes the negative part by multiplying it with β and lastlymoves the output distribution by ζ.

obtain the best distribution for taking a sign. From the latter angle, RPReLU canbe easily interpreted as γ shifts the input distribution, finding a best point to useβ to “fold” the distribution, then ζ shifts the output distribution, as illustratedin Fig. 5. These learned coefficients automatically adjust activation distributionsfor obtaining good binary features, which enhances the 1-bit CNNs’ performance.With the introduction of these functions, the aforementioned difficulty in dis-tributional learning can be greatly alleviated, and the 1-bit convolutions caneffectively focus on learning more meaningful patterns. We will show later inthe result section that this enhancement can boost the baseline network’s top-1accuracy substantially.

The number of extra parameters introduced by RSign and RPReLU is only4×number of channels in the network, which is negligible considering the largesize of the weight matrices. The computational overhead approximates a typicalnon-linear layer, which is also trivial compared to the computational intensiveconvolutional operations.

Optimization

Parameters in RSign and RPReLU can be optimized end-to-end with otherparameters in the network. The gradient of αi in RSign can be simply derivedby the chain rule as:

∂L∂αi

=∑xri

∂L∂h(xri )

∂h(xri )

∂αi, (5)

ReActNet 9

where L represents the loss function and ∂L∂h(xri ) denotes the gradients from deeperlayers. The summation

∑xri

is applied to all entries in the ith channel. The

derivative∂h(xri )∂αi

can be easily computed as

∂h(xri )

∂αi= −1 (6)

Similarly, for each parameter in RPReLU, the gradients are computed withthe following formula:

∂f(xi)

∂βi= I{xi≤γi} · (x− γi), (7)

∂f(xi)

∂γi= −I{xi≤γi} · βi − I{xi>γi}, (8)

∂f(xi)

∂ζi= 1. (9)

Here, I denotes the indicator function. I{·} = 1 when the inequation inside {}holds, otherwise I{·} = 0.

4.3 Distributional Loss

Based on the insight that if the binary neural networks can learn similar dis-tributions as real-valued networks, the performance can be enhanced, we use adistributional loss to enforce this similarity, formulated as:

LDistribution = −1

n

∑c

n∑i=1

pRθc (Xi) log(pBθc (Xi)

pRθc (Xi)), (10)

where the distributional loss LDistribution is defined as the KL divergence be-tween the softmax output pc of a real-valued network Rθ and a binary networkBθ. The subscript c denotes classes and n is the batch size.

Different from prior work [46] that needs to match the outputs from every in-termediate block, or further using multi-step progressive structural transition [3],we found that our distributional loss, while much simpler, can yield competitiveresults. Moreover, without block-wise constraints, our approach enjoys the flexi-bility in choosing the real-valued network without the requirement of architecturesimilarity between real and binary networks.

5 Experiments

To investigate the performance of the proposed methods, we conduct experi-ments on ImageNet dataset. We first introduce the dataset and training strategyin Section 5.1, followed by comparison between the proposed networks and state-of-the-arts in terms of both accuracy and computational cost in Section 5.2. We

10 Z. Liu et al.

then analyze the effects of the distributional loss, concatenated downsamplinglayer and the RSign and the RPReLU in detail in the ablation study describedin Section 5.3. Visualization results on how RSign and RPReLU help binarynetwork capture the fine-grained underlying distribution are presented in Sec-tion 5.4.

5.1 Experimental Settings

Dataset The experiments are carried out on the ILSVRC12 ImageNet clas-sification dataset [31], which is more challenging than small datasets such asCIFAR [16] and MNIST [27]. In our experiments, we use the classic data aug-mentation method described in [15].Training Strategy We followed the standard binarization method in [23] andadopted the two-step training strategy as [3]. In the first step, we train a networkwith binary activations and real-valued weights from scratch. In the second step,we inherit the weights from the first step as the initial value and fine-tune thenetwork with weights and activations both being binary. For both steps, Adamoptimizer with a linear learning rate decay scheduler is used, and the initiallearning rate is set to 5e-4. We train it for 600k iterations with batch size being256. The weight decay is set to 1e-5 for the first step and 0 for the second step.Distributional Loss In both steps, we use proposed distributional loss as theobjective function for optimization, replacing the original cross-entropy loss be-tween the binary network output and the label.OPs Calculation We follow the calculation method in [3], we count the binaryoperations (BOPs) and floating point operations (FLOPs) separately. The totaloperations (OPs) is calculated by OPs = BOPs/64 + FLOPs, following [30,23].

5.2 Comparison with State-of-the-art

We compare ReActNet with state-of-the-art quantization and binarization meth-ods. Table 1 shows that ReActNet-A already outperforms all the quantizingmethods in the left part, and also archives 4.0% higher accuracy than the state-of-the-art Real-to-Binary Network [3] with only approximately half of the OPs.Moreover, in contrast to [3] which computes channel re-scaling for each blockwith real-valued fully-connected layers, ReActNet-A has pure 1-bit convolutionsexcept the first and the last layers, which is more hardware-friendly.

To make further comparison with previous approaches that use real-valuedconvolution to enhance binary network’s accuracy [23,3,2], we constructed Re-ActNet-B and ReActNet-C, which replace the 1-bit 1×1 convolution with real-valued 1×1 convolution in the downsampling layers, as shown in Fig. 6(c).ReActNet-B defines the real-valued convolutions to be group convolutions with4 groups, while ReActNet-C uses full real-valued convolution. We show thatReActNet-B achieves 13.7% higher accuracy than Bi-RealNet-18 with the samenumber of OPs and ReActNet-C outperforms MeliusNet59 by 0.7% with lessthan half of the OPs.

ReActNet 11

MethodsBitwidth(W/A)

Acc(%)Binary Methods

BOPs(×109)

FLOPs(×108)

OPs(×108)

Acc(%)Top-1 Top-1

BWN [7] 1/32 60.8 BNNs [7] 1.70 1.20 1.47 42.2TWN [18] 2/32 61.8 CI-BCNN [37] – – 1.63 59.9INQ [42] 2/32 66.0 Binary MobileNet [28] – – 1.54 60.9TTQ [44] 2/32 66.6 PCNN [11] – – 1.63 57.3SYQ [10] 1/2 55.4 XNOR-Net [30] 1.70 1.41 1.67 51.2

HWGQ [5] 1/2 59.6 Trained Bin [38] – – – 54.2LQ-Nets [39] 1/2 62.6 Bi-RealNet-18 [23] 1.68 1.39 1.63 56.4

DoReFa-Net [43] 1/4 59.2 Bi-RealNet-34 [23] 3.53 1.39 1.93 62.2Ensemble BNN [45] (1/1) × 6 61.1 Bi-RealNet-152 [21] 10.7 4.48 6.15 64.5Circulant CNN [20] (1/1) × 4 61.4 Real-to-Binary Net [3] 1.68 1.56 1.83 65.4

Structured BNN [47] (1/1) × 4 64.2 MeliusNet29 [2] 5.47 1.29 2.14 65.8Structured BNN* [47] (1/1) × 4 66.3 MeliusNet42 [2] 9.69 1.74 3.25 69.2

ABC-Net [19] (1/1) × 5 65.0 MeliusNet59 [2] 18.3 2.45 5.32 70.7Our ReActNet-A 1/1 – – 4.82 0.12 0.87 69.4Our ReActNet-B 1/1 – – 4.69 0.44 1.63 70.1Our ReActNet-C 1/1 – – 4.69 1.40 2.14 71.4

Table 1. Comparison of the top-1 accuracy with state-of-the-art methods. The leftpart presents quantization methods applied on ResNet-18 structure and the right partare binarization methods with varied structures (ResNet-18 if not specified). Quan-tization methods include weight quantization (upper left block), low-bit weight andactivation quantization (middle left block) and the weight and activation binarizationwith the expanded network capacity (lower left block), where the number times (1/1)indicates the multiplicative factor. (W/A) represents the number of bits used in weightor activation quantization.

Moreover, we applied the ReAct operations to Bi-RealNet-18, and obtained65.5% Top-1 accuracy, increasing the accuracy of Bi-RealNet-18 by 9.1% withoutchanging the network structure.

Considering the challenges in previous attempts to enhance 1-bit CNNs’ per-formance, the accuracy leap achieved by ReActNets is significant. It requiresan ingenious use of binary networks’ special property to effectively utilize everyprecious bit and strike a delicate balance between binary and real-valued in-formation. For example, ReActNet-A, with 69.4% top-1 accuracy at 87M OPs,outperforms the real-valued 0.5× MobileNetV1 by 5.7% greater accuracy at41.6% fewer OPs. These results demonstrate the potential of 1-bit CNNs andthe effectiveness of our ReActNet design.

5.3 Ablation Study

We conduct ablation studies to analyze the individual effect of the followingproposed techniques:Block Duplication and Concatenation Real-valued shortcuts are crucial forbinary neural network accuracy [23]. However, the input channels of the down-sampling layers are twice the output channels, which violates the requirement foradding the shortcuts that demands an equal number of input and output chan-nels. In the proposed baseline network, we duplicate the downsampling blocksand concatenate the outputs (Fig. 6(a)), enabling the use of identity shortcuts

12 Z. Liu et al.

Network Top-1 Acc(%)

Baseline network † * 58.2Baseline network † 59.6Proposed baseline network * 61.1Proposed baseline network 62.5

Proposed baseline network + PReLU 65.5Proposed baseline network + RSign 66.1Proposed baseline network + RPReLU 67.4ReActNet-A (RSign and RPReLU) 69.4

Corresponding real-valued network 72.4

Table 2. The effects of different components in ReActNet on the final accuracy. (†denotes the network not using the concatenated blocks, but directly binarizing thedownsampling layers instead. * indicates not using the proposed distributional lossduring training.)

Sign

1-bit1x1Conv

BatchNorm

Sign

1-bit1x1Conv

BatchNorm

Duplicateactivation

Concatenate

Duplicateactivation

Sign

Inplanes,1-bit1x1Conv,outplanes

BatchNorm

Real1x1Conv,group=g

BatchNorm

PReLU

Real1x1Conv,group=g

BatchNorm

PReLU

Concatenate

(a) Concatenated 1-bit convolutional block in Proposed Baseline Network

(b) Normal 1-bit convolutional block for downsampling in baseline network

(c) Real valued convolutional block for downsampling in ReActNet-B and C

Fig. 6. Variations in the downsampling layer design.

to bypass 1-bit convolutions in the downsampling layers. This idea alone resultsin a 2.9% accuracy enhancement compared to the network without concatena-tion (Fig. 6(b)). The enhancement can be observed by comparing the 2nd and4th rows of Table 2. With the proposed downsampling layer design, our baselinenetwork achieves both high accuracy and high compression ratio. Because it nolonger requires real-valued matrix multiplications in the downsampling layers,the computational cost is greatly reduced. As a result, even without using thedistributional loss in training, our proposed baseline network has already sur-passed the Strong Baseline in [3] by 0.1% for top-1 accuracy at only half of theOPs. With this strong performance, this simple baseline network serves well asa new baseline for future studies on compact binary neural networks.Distributional Loss The results in the first section of Table 2 also validate thatthe distributional loss designed for matching the output distribution betweenbinary and real-valued neural networks is effective for enhancing the performanceof propose baseline network, improving the accuracy by 1.4%, which is achievedindependent of the network architecture design.ReAct Operations The introduction of RSign and RPReLU improves the ac-curacy by 4.9% and 3.6% respectively over the proposed baseline network, asshown in the second section of Table 2. By adding both RSign and RPReLU,

ReActNet 13

Fig. 7. Comparing validation accuracy curves between baseline networks and ReAct-Net. Using proposed RSign and RPReLU (red curve) achieves the higher accuracy andis more robust than using Sign and PReLU (green curve).

ReActNet-A achieves 6.9% higher accuracy than the baseline, narrowing theaccuracy gap to the corresponding real-valued network to within 3.0%. Com-pared to merely using the Sign and PReLU, the use of the generalized activationfunctions, RSign and RPReLU, with simple learnable parameters boost the ac-curacy by 3.9%, which is very significant for the ImageNet classification task.As shown in Fig. 7, the validation curve of the network using original Sign +PReLU oscillates vigorously, which is suspected to be triggered by the slopecoefficient β in PReLU changing its sign which in turn affects the later layerswith an avalanche effect. This also indirectly confirms our assumption that 1-bitCNNs are vulnerable to distributional changing. In comparison, the proposedRSign and RPReLU functions are effective for stabilizing training in addition toimproving the accuracy.

5.4 Visualization

To help gain better insights, we visualize the learned coefficients as well as theintermediate activation distributions.Learned Coefficients For clarity, we present the learned coefficients of eachlayer in form of the color bar in Fig. 8. Compared to the binary network usingtraditional PReLU whose learned slopes β are positive only (Fig. 8(a)), ReAct-Net using RPReLU learns both positive and negative slopes (Fig. 8(c)), whichare closer to the distributions of PReLU coefficients in a real-valued network wetrained (Fig. 8(b)). The learned distribution shifting coefficients also have largeabsolute values as shown in Rows 1-3 of Fig. 8(c), indicating the necessity oftheir explicit shift for high-performance 1-bit CNNs.Activation Distribution In Fig. 9, we show the histograms of activation dis-tributions inside the trained baseline network and ReActNet. Compared to thebaseline network without RSign and RPReLU, ReActNet’s distributions aremore enriched and subtle, as shown in the forth sub-figure in Fig. 9(b). Also, inReActNet, the distribution of -1 and +1 after the sign function is more balanced,as illustrated in the second sub-figure in Fig. 9(b), suggesting better utilizationof black and white pixels in representing the binary features.

14 Z. Liu et al.

!"#̅% ̅&̅ 6.56 -0.62

-0.61

(c) ReActNet-A&

&(a) Baseline Network + PReLU

(b) Corresponding real-valued network with PReLU

6.15

1.99 -0.61

3e-6Target layer Target layer

1.0

0.5

0.0

-0.5

-1.0

Fig. 8. The color bar of the learned coefficients. Blue color denotes the positive valueswhile red denotes the negative, and the darkness in color reflects the absolute value.We also mark coefficients that have extreme values.

1-bit 3x3 Conv BatchNormSign

1-bit 3x3 Conv BatchNorm RPReLURSign

(a) Proposed Baseline Network

(b) ReActNet

Fig. 9. Histogram of the activation distribution

6 Conclusions

In this paper, we present several new ideas to optimize a 1-bit CNN for higheraccuracy. We first design parameter-free shortcuts based on MobileNetV1 topropagate real-valued feature maps in both normal convolutional layers as wellas the downsampling layers. This yields a baseline binary network with 61.1%top-1 accuracy at only 87M OPs for the ImageNet dataset. Then, based on ourobservation that 1-bit CNNs’ performance is highly sensitive to distributionalvariations, we propose ReAct-Sign and ReAct-PReLU to enable shift and re-shape the distributions in a learnable fashion and demonstrate their dramaticalenhancements on the top-1 accuracy. We also propose to incorporate a distri-butional loss, which is defined between the outputs of the binary network andthe real-valued reference network, to replace the original cross-entropy loss fortraining. With contributions jointly achieved by these ideas, the proposed Re-ActNet achieves 69.4% top-1 accuracy on ImageNet, which is just 3% shy of itsreal-valued counterpart while at substantially lower computational cost.

Acknowledgement

The authors would like to acknowledge HKSAR RGC’s funding support undergrant GRF-16207917.

ReActNet 15

References

1. Alizadeh, M., Fernández-Marqués, J., Lane, N.D., Gal, Y.: An empirical study ofbinary neural networks’ optimisation (2018) 4

2. Bethge, J., Bartz, C., Yang, H., Chen, Y., Meinel, C.: Meliusnet: Can binary neuralnetworks achieve mobilenet-level accuracy? arXiv preprint arXiv:2001.05936 (2020)2, 4, 10, 11

3. Brais Martinez, Jing Yang, A.B.G.T.: Training binary neural networks with real-to-binary convolutions. International Conference on Learning Representations (2020)2, 3, 4, 6, 9, 10, 11, 12

4. Bulat, A., Tzimiropoulos, G.: Xnor-net++: Improved binary neural networks.British Machine Vision Conference (2019) 4, 6

5. Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave gaussian quantization. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 5918–5926 (2017) 11

6. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient objectdetection models with knowledge distillation. In: Advances in Neural InformationProcessing Systems. pp. 742–751 (2017) 3

7. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neuralnetworks: Training deep neural networks with weights and activations constrainedto+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016) 1, 3, 11

8. Ding, R., Chin, T.W., Liu, Z., Marculescu, D.: Regularizing activation distributionfor training binarized deep networks. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 11408–11417 (2019) 1, 2, 4

9. Ding, X., Zhou, X., Guo, Y., Han, J., Liu, J., et al.: Global sparse momentumsgd for pruning very deep neural networks. In: Advances in Neural InformationProcessing Systems. pp. 6379–6391 (2019) 3

10. Faraone, J., Fraser, N., Blott, M., Leong, P.H.: Syq: Learning symmetric quanti-zation for efficient deep neural networks. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4300–4309 (2018) 11

11. Gu, J., Li, C., Zhang, B., Han, J., Cao, X., Liu, J., Doermann, D.: Projectionconvolutional neural networks for 1-bit cnns via discrete back propagation. In:Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8344–8351 (2019) 4, 11

12. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural net-works. In: Proceedings of the IEEE International Conference on Computer Vision.pp. 1389–1397 (2017) 3

13. Helwegen, K., Widdicombe, J., Geiger, L., Liu, Z., Cheng, K.T., Nusselder, R.:Latent weights do not exist: Rethinking binarized neural network optimization. In:Advances in neural information processing systems. pp. 7531–7542 (2019) 4

14. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531 (2015) 3

15. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 2, 3, 5, 10

16. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images.Tech. rep., Citeseer (2009) 3, 10

17. Li, B., Shan, Y., Hu, M., Wang, Y., Chen, Y., Yang, H.: Memristor-based approxi-mated computation. In: Proceedings of the 2013 International Symposium on LowPower Electronics and Design. pp. 242–247. IEEE Press (2013) 1

16 Z. Liu et al.

18. Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprintarXiv:1605.04711 (2016) 11

19. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network.In: Advances in Neural Information Processing Systems. pp. 345–353 (2017) 4, 11

20. Liu, C., Ding, W., Xia, X., Zhang, B., Gu, J., Liu, J., Ji, R., Doermann, D.: Circu-lant binary convolutional networks: Enhancing the performance of 1-bit dcnns withcirculant back propagation. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 2691–2699 (2019) 11

21. Liu, Z., Luo, W., Wu, B., Yang, X., Liu, W., Cheng, K.T.: Bi-real net: Binarizingdeep network towards real-network performance. International Journal of Com-puter Vision pp. 1–18 (2018) 4, 11

22. Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K.T., Sun, J.: Metapruning:Meta learning for automatic neural network channel pruning. In: Proceedings ofthe IEEE International Conference on Computer Vision. pp. 3296–3305 (2019) 3

23. Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., Cheng, K.T.: Bi-real net: Enhanc-ing the performance of 1-bit cnns with improved representational capability andadvanced training algorithm. In: Proceedings of the European conference on com-puter vision (ECCV). pp. 722–737 (2018) 2, 4, 5, 6, 10, 11

24. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolu-tional networks through network slimming. In: Proceedings of the IEEE Interna-tional Conference on Computer Vision. pp. 2736–2744 (2017) 3

25. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines forefficient cnn architecture design. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 116–131 (2018) 3

26. Mishra, A., Nurvitadhi, E., Cook, J.J., Marr, D.: Wrpn: wide reduced-precisionnetworks. arXiv preprint arXiv:1709.01134 (2017) 4

27. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digitsin natural images with unsupervised feature learning. In: NIPS workshop on deeplearning and unsupervised feature learning. vol. 2011, p. 5 (2011) 3, 10

28. Phan, H., Liu, Z., Huynh, D., Savvides, M., Cheng, K.T., Shen, Z.: Binarizingmobilenet via evolution-based searching. In: Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition. pp. 13420–13429 (2020) 4,11

29. Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., Song, J.: Forward and back-ward information retention for accurate binary neural networks. In: Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.2250–2259 (2020) 4

30. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classi-fication using binary convolutional neural networks. In: European conference oncomputer vision. pp. 525–542. Springer (2016) 1, 2, 3, 4, 6, 10, 11

31. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)3, 10

32. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In-verted residuals and linear bottlenecks. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 4510–4520 (2018) 3

33. Shen, Z., He, Z., Xue, X.: Meal: Multi-model ensemble via adversarial learning.In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp.4886–4893 (2019) 3

ReActNet 17

34. Soudry, D., Hubara, I., Meir, R.: Expectation backpropagation: Parameter-freetraining of multilayer neural networks with continuous or discrete weights. In:Advances in Neural Information Processing Systems. pp. 963–971 (2014) 3

35. Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neuralnetworks: A tutorial and survey. Proceedings of the IEEE 105(12), 2295–2329(2017) 3

36. Tang, W., Hua, G., Wang, L.: How to train a compact binary neural network withhigh accuracy? In: Thirty-First AAAI conference on artificial intelligence (2017) 4

37. Wang, Z., Lu, J., Tao, C., Zhou, J., Tian, Q.: Learning channel-wise interactionsfor binary convolutional neural networks. In: The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (June 2019) 2, 4, 6, 11

38. Xu, Z., Cheung, R.C.: Accurate and compact convolutional neural networks withtrained binarization. British Machine Vision Conference (2019) 6, 11

39. Zhang, D., Yang, J., Ye, D., Hua, G.: Lq-nets: Learned quantization for highlyaccurate and compact deep neural networks. In: Proceedings of the European con-ference on computer vision (ECCV). pp. 365–382 (2018) 3, 11

40. Zhang, J., Pan, Y., Yao, T., Zhao, H., Mei, T.: dabnn: A super fast inferenceframework for binary neural networks on arm devices. In: Proceedings of the 27thACM International Conference on Multimedia. pp. 2272–2275 (2019) 1

41. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolu-tional neural network for mobile devices. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition. pp. 6848–6856 (2018) 3

42. Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization:Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044(2017) 11

43. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training lowbitwidth convolutional neural networks with low bitwidth gradients. arXiv preprintarXiv:1606.06160 (2016) 3, 11

44. Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. arXivpreprint arXiv:1612.01064 (2016) 11

45. Zhu, S., Dong, X., Su, H.: Binary ensemble neural network: More bits per networkor more networks per bit? In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 4923–4932 (2019) 11

46. Zhuang, B., Shen, C., Tan, M., Liu, L., Reid, I.: Towards effective low-bitwidthconvolutional neural networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 7920–7928 (2018) 3, 4, 9

47. Zhuang, B., Shen, C., Tan, M., Liu, L., Reid, I.: Structured binary neural networksfor accurate image classification and semantic segmentation. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. pp. 413–422 (2019)11

reactnet: towards precise binary neural network with ......reactnet: towards precise binary neural...

Documents