journal of la learning semantically enhanced …submitted date: 07/5/2020. this work was supported...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Learning Semantically Enhanced Feature forFine-Grained Image ClassificationWei Luo*, Member, IEEE, Hengmin Zhang*, Jun Li, and Xiu-Shen Wei

Abstract—We aim to provide a computationally cheap yeteffective approach for fine-grained image classification (FGIC)in this letter. Unlike previous methods that rely on complex partlocalization modules, our approach learns fine-grained featuresby improving the semantics of sub-features of a global feature.Specifically, we first achieve the sub-feature semantic by arrang-ing feature channels of a CNN into different groups throughchannel permutation. Meanwhile, to enhance the discriminabilityof sub-features, the groups are guided to be activated on objectparts with strong discriminability by a weighted combinationregularization. This process brings only 1.7% additional param-eters to its ResNet-50 backbone. Moreover, our approach canbe easily integrated into the backbone model as a plug-and-playmodule for end-to-end training with only image-level supervision.Experiments verified the effectiveness of our approach andvalidated its comparable performance to the state-of-the-artmethods. Code is available at https://github.com/cswluo/SEF

Index Terms—Image classification, visual categorization, fea-ture learning

I. INTRODUCTION

F INE-grained image classification (FGIC) concerns thetask of distinguishing subordinate categories of some base

classes such as dogs [1], birds [2], cars [3], aircraft [4]. Dueto the large intra-class pose variation and high inter-classappearance similarity, as well as the scarcity of annotated data,it is challenging to efficiently solve the FGIC problem.

Recent studies have shown great interests in tackling theFGIC problem by unifying part localization and feature learn-ing in an end-to-end CNN [5], [6], [7], [8], [9], [10], [11].For example, [12], [13], [5] first crop object parts on inputimages and feed them into models for feature extraction, wherethe part locations are obtained by taking as input the image-level features extracted by a convolutional network. To betterexploit the relationship between feature channels, [8], [14]produce parts features by weighting feature channels using softattentions. Another line of research divides feature channelsinto several groups with each corresponding to a semantic partof the input image [6], [9]. However, these methods usuallymake the model optimization more difficult, since they eitherrely on complex part localization modules, or introduce a large

Submitted date: 07/5/2020. This work was supported in part by NSFCunder grant 61702197, in part by NSFGD under grant 2020A151501813 and2017A030310261.

Wei Luo is with the South China Agricultural University, Guangzhou,510000, China (email:[email protected]).

Hengmin Zhang is with the East China University of Science and Technol-ogy, Shanghai, 200237, China (email:[email protected]).

Jun Li and Xiu-Shen Wei are with the Nanjing University of Science andTechnology, Nanjing, 210094, China (email:junli,[email protected]).

* indicates equal contribution. Wei Luo is the corresponding author.

number of parameters into their backbone models, or requirea separate module to guide the learning of the feature channelgrouping.

In this letter, we propose a computationally cheap yet ef-fective approach that learns fine-grained features by improvingthe sub-features semantics and discriminability. It includes twocore components: 1) a semantic grouping module that arrangesfeature channels with similar properties into the same groupthat corresponds to a semantic part of the input image; 2) afeature enhancement module that improves the discriminabilityof the grouped features by guiding them to be activated onobject parts with strong discriminability. The two componentscan be easily implemented as a plug-and-play module inmodern CNNs without the need of guided initialization ormodifying the backbone network structure. By coupling thetwo components together, our approach ensures the generationof powerful and distinguishable fine-grained features withoutrequiring complex part localization modules.

Concretely, we construct semantic groups in the last convo-lutional layer of a CNN by arranging feature channels througha permutation matrix, which is learned implicitly throughregularizing the relationship between feature channels, i.e.,maximizing the correlations between feature channels in thesame predefined group and decorrelating those in different pre-defined groups. Thus, our feature channel grouping does notintroduce any additional parameters into the backbone model.Compared to [9], [6], our strategy groups feature channelsmore consistently and requires no guided initialization. Toguide the semantic groups to be activated on object parts withstrong discriminability, we introduce a regularization methodthat employs a weighted combination of maximum entropylearning and knowledge distillation. The strategy of weightedcombination is derived from matching prediction distributionsbetween the outputs of the global feature and its group-wisesub-features. This regularization method introduces a smallnumber of parameters into the ResNet-50 backbone due tooutputting prediction distributions of sub-features. Overall,by coupling the two components together, our approach canefficiently obtain fine-grained features with strong discrim-inability. In addition, our approach can be easily integrated intothe backbone model as a plug-and-play module for end-to-endtraining with only image-level supervision. Our contributionsare summarized as follows:• We propose a computationally cheap FGIC approach

that achieves comparable performance to the state-of-the-art methods with only 1.7% parameters more than itsResNet-50 backbone.

• We propose to achieve part localization by learning se-

arX

iv:2

006.

1345

7v2

[cs

.CV

] 5

Jul

202

0

https://github.com/cswluo/SEF


CNN

Semantic groups learning

… .

… .

Tropical Kingbird

Least Flycatcher

Herring Gull

Gadwall

Horned Puffin

Mallard

EntropyKnowledge

distillation

Fig. 1. Overview of our approach. The last-layer convolutional feature channels (depicted by the mixed color block) of the CNN are arranged into differentgroups (represented by different colors) by our semantic groups learning module. The global and its group-wise features are obtained from the arranged featurechannels by average pooling. The light yellow block in the gray block denotes the predicted class distributions from corresponding group-wise feature, whichare regularized by the output of the global feature through knowledge distillation. All gray blocks are only effective in the training stage while removed inthe testing stage. The details of the CNN are omitted for clarity. Best viewed in color.

mantic groups of feature channels, which does not requireguided initialization and extra parameters.

• We propose to enhance feature discriminability by guid-ing its sub-features to be extracted from object parts withstrong discriminability.

II. PROPOSED APPROACH

The proposed approach involves two main components: 1) Asemantic grouping module that arranges feature channels withsimilar properties into the same group to have semantic mean-ing. 2) A feature enhancement module that elevates featuresperformance by improving its sub-features discriminability.Fig. 1 is an overview of our approach.

A. Semantic Groups Learning

It has been investigated in previous work [15] that bunchesof filters in high-level layers of CNNs are required to representa semantic concept. Thus we develop a regularization methodto arrange filters with similar properties into the same group tocapture a semantic concept. Specifically, given a convolutionalfeature XL ∈ RC×WH with channel XL

i ∈ RWH , i ∈[ 1, · · · , C] in layer L, we first arrange its feature channelsthrough a permutation operation XL′

= AXL, where A ∈RC×C is a permutation matrix, and then divide the channelsinto G groups with an expect that each group correspondsto a semantic concept. Since XL is obtained by convolvingfilters in layer L with features in layer L−1. The convolutionoperation can be implemented as a matrix product

XL = BXL−1, (1)

where B ∈ RC×Ω and XL−1 are respectively the reshapedfilters of layer L with each row a filter, and the reshapedfeature of layer L− 1. Thus XL′

can be rewritten as

XL′= AXL = ABXL−1 = WXL−1, (2)

where W = AB consists of the same filters as B, but therows are exchanged by the permutation matrix A.

To achieve semantic groups, A should be learned to dis-cover the similarities between filters of B. It is, however,nontrivial to learn a permutation matrix straightforwardly.Therefore, we bypass learning A by instead constrainingthe relationships between feature channels in XL′

to learnW directly. To effect, we maximize the correlation between

feature channels in the same group while decorrelating thosein different groups. Concretely, let XL′

i ← XL′

i /||XL′

i ||2 be anormalized channel. The correlation between channels is thendefined as

dij = XL′

i XL′Tj , (3)

where T is the transpose operation. Let D ∈ RG×G be thecorrelation matrix with element Dmn = 1

CmCn

∑i∈m,j∈n dij

corresponding to the average correlation of feature channelsfrom groups m and n, where m,n ∈ 1, · · · , G and Cm is thenumber of channels in group m. Then the semantic groupscan be achieved by minimizing

Lgroup =1

2(‖D‖2F − 2‖diag(D)‖22), (4)

where diag(·) extracts the main diagonal of a matrix.In CNNs, Eq. 2 can be implemented as a convolutional

operation. Eq. 4 can be regularized on feature channels withoutintroducing any additional parameters. Different from [11]in which a semantic mapping matrix is explicitly learnedto reduce the redundancy of bilinear features, our strategyfocuses on improving the semantics of sub-features and doesnot modify the backbone network structure.

B. Feature Enhancement

Semantic groups can drive features in different groups tobe activated by different object parts. However, the discrim-inability of those parts may not be guaranteed. We, therefore,need to guide these semantic groups to be activated on objectparts with strong discriminability. A simple way to achieve thiseffect is to matching the prediction distributions between theobject and its parts as implemented in [14]. However, it is un-clear why matching distributions can improve the performance.Here, we provide an analysis to understand its principles andmake improvements to achieve better performance.

Let Pw and Pa be the prediction distributions of an objectand its part, respectively. Matching their distributions equalsto minimize their KL divergence loss [14],

LKL(Pw||Pa) = −H(Pw) + H(Pw,Pa), (5)

where H(Pw) = −∑

Pw logPw and H(Pw,Pa) =−∑

Pw logPa. The optimization objective of a classificationtask can then be generally written as

L = Lcr + λLKL, (6)


where Lcr is the cross entropy loss and λ is the balance weight.Substituting Eq. 5 into Eq. 6 we have

L = Lcr − λH(Pw) + λH(Pw,Pa). (7)

Eq. 7 implies that minimizing matching prediction distribu-tions can be decomposed into a maximum entropy term anda knowledge distillation term. Each of them is a powerfulregularization method for model learning [16], [17]. Maximumentropy learning is especially effective for FGIC as provedin [18] that reducing the confidence of classifiers can lead tobetter generalization in low data-diversity scenarios. The lastterm in Eq. 7 distills knowledge from the global feature tothe local feature, resulting in models discovering more dis-criminative parts. To this end, we can regulate the importanceof the two terms separately for a better performance. Put alltogether, our optimization objective can be formulated as

L = Ex

(Lcr−λH(Pw)+

γ

G

G∑g=1

H(Pw,Pga)+φLgroup

). (8)

Here, λ, γ, φ and G are hyper-parameters and we omit thedependence on x for clarity.

In our implementation, the last-layer feature channels arefirst pooled averagely and then simultaneously fed into G+1neural networks (NNs) to output class distributions, in whichone takes as input the whole feature and the others take asinput only features in corresponding groups. Only the NNtaking the whole feature is used for target prediction (seeFig. 1).

III. EXPERIMENTS

To demonstrate the effectiveness of the simple constructionof our approach for FGIC, We present comparisons to thestate-of-the-art methods and ablation studies in this section.

A. Experimental Setup

We employ ResNet-50 [19] as the backbone of our ap-proach in PyTroch and experiment on CUB-Birds [2], StanfordCars [3], Stanford Dogs [1], and FGVC-Aircraft [4] datasets(see Table I for data statistics). We initialize our model usingthe pretrained weights on ImageNet [20] and fine-tune alllayers on batches of 32 images of size 448×448 by SGD [21]with momentum of 0.9. Random flip is employed in trainingwhile not in testing. The G + 1 NNs are all single-layernetworks. The initial learning rate, lr, is 0.01 except on Dogswhere 0.001 is used. lr decays by 0.1 every 20 epochs with atotal of 50 training epochs. λ, γ and φ are correspondingly setto 1, 0.05 and 1 across all datasets, validated through cross-validation on Birds, which contains 10% of the total samplesin the training set. G is set to 3, 4 and 2 on Aircraft, Birdsand the other two datasets, respectively.

Metrics. Except classification accuracy, we also employscores to rank the comprehensive performance of a methodacross datasets. Given the performance of method m on Ldatasets, the score of method m is Sm = 1

L

∑Ll=1 s

ml , where

sml is the rank of method m on the l-th dataset based on itsclassification accuracy. The closer the score is to 1, the better.

TABLE ITHE STATISTICS OF FINE-GRAINED DATASETS IN THIS LETTER

Datasets #category #training #testing

CUB-Birds [2] 200 5, 994 5, 794Stanford Cars [3] 196 8,144 8,041Stanford Dogs [1] 120 12,000 8,580FGVC-Aircraft [4] 100 6,667 3,333

B. Comparison with the State-of-the-Art

To be fair, we only compare to weakly-supervised methodsemploying the ResNet50 backbone, due to its popularity andstate-of-the-art performance in recent FGIC work. Notice thatwe are not attending to achieve the best performance but toemphasize the advantage brought by our simple construction.

Complexity analysis. Our approach introduces additionalparameters into the backbone network only in knowledgedistillation, where the total number of parameters is con-strained by the multiplication of feature dimensionality and thenumber of classes. For example, 409, 600 new parameters areintroduced on the Birds dataset, which account for only 1.7%of the parameters of ResNet-50. Since there are only negligibleadditional parameters in our approach, the network is efficientto train. Compared with computation-intensive methods suchas S3N [10], TASN [9], API-Net [22], and DCL [23] (requires60, 90, 100, and 180 epochs for training, respectively), ourapproach can be optimized in 50 epochs.

During testing, only the backbone network is activated withall additional modules removed. Compared with its ResNet-50 backbone, our approach boosts performance +1.65% onaverage across all datasets with the same time cost at inference,which indicates the practical value of our approach.

TABLE IICOMPARISON WITH STATE-OF-THE-ART METHODS (%)

Birds Cars Dogs Aircraft Scores

Kernel-Pooling? [24] 84.7 92.4 − 86.9 10.3MAMC-CNN [8] 86.2 93.0 84.8 − 7.7DFB-CNN? [7] 87.4 93.8 − 92.0 6.3NTS-Net† [13] 87.5 93.9 − 91.4 6.0S3N† [10] 88.5 94.7 − 92.8 1.7API-Net [22] 87.7 94.8 88.3 93.0 2.3DCL [23] 87.8 94.5 − 93.0 2.7TASN† [9] 87.9 93.8 − − 5.0Cross-X [14] 87.7 94.6 88.9 92.6 2.8

ResNet-50‡ [19] 84.5 92.9 88.1 90.3 8.3MaxEnt-CNN‡ [18] 83.4 92.8 88.0 90.7 8.8DBT-Net [11] 87.5 94.1 − 91.2 5.7

SEF (ours) 87.3 94.0 88.8 92.1 4.8

?,† and ‡ represent methods with separated initialization, multi-cropping operations and, results from our re-implementation,respectively. The closer the score is to 1, the better.

Performance comparison. Table II shows that there isno a single method can achieve the best performance on alldatasets. Our approach attains a score of 4.8 ranking 5th inoverall performance, which is comparable to the state-of-the-art methods, especially considering its simple construction.The methods in the second group are closely related to


0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

Fig. 2. From left to right are correlation matrices of feature channels ofmodels with 3, 5, and 7 groups, averaged on 64 images randomly selectedfrom the Birds testing dataset, respectively.

TABLE IIIPERFORMANCE OF OUR METHOD ON DIFFERENT OPTIONS.

Birds Dogs Cars Aircraft

ResNet-18 [19] 82.3 81.5 90.4 88.5λ = 0, γ = 0, φ = 1 82.7 82.1 90.9 88.9λ = 0.05, γ = 0.05, φ = 0 82.2 81.8 90.3 88.8λ = 0.05, γ = 0.05, φ = 1 83.2 81.8 90.1 89.0λ = 1, γ = 0.05, φ = 1 84.8 83.1 91.8 89.3

ours. Compared to ResNet-50 and MaxEnt-CNN, our ap-proach boosts performance on all datasets, which indicatesthe effectiveness of our regularization modules. Moreover, ourapproach achieves a more robust performance than DBT-Net.Among other methods, S3N, API-Net, DCL, and Cross-X rankbefore ours. They, however, are more expensive than ours (seethe complexity analysis above) such as Cross-X using twiceas many parameters as ResNet-50.

C. Ablation Studies

Effectiveness of individual module. We adopt ResNet-18as the backbone of our model for the analysis since it presentsthe same trend as that of ResNet-50 besides its simplicity.Table III shows the results. It reveals that learning semanticgroups or matching distributions independently can improveperformance slightly. Combining both straightforwardly bringssome advantages due to the introduction of semantics into thefeature, but the improvement is not systematically consistent.The performance, however, can be significantly improved bydecomposing matching distributions into separate regularizers,which indicates the success of guiding semantic groups to beactivated on object parts with strong discriminability. Noticethat the parameters are determined through cross-validation onBirds and applied to all datasets without modification, whichsignifies the robustness of our approach across datasets.

Number of semantic groups. The group semantic isstrongly related to its discriminability, thus compulsorily sepa-rating strongly correlated feature channels into different groupsmay break its semantics, thereby reducing its discriminability.Fig. 2 and Fig. 3 shows the correlations between the last-layer feature channels of our approach based on the backboneof ResNet-18 and, the discriminability of group features ofmodels with varying groups, respectively. It signifies that 1)our semantic grouping module can effectively achieve thefunction of grouping correlated feature channels and separat-ing uncorrelated feature channels; 2) feature channels shouldnot be divided into too many groups, which on the one hand

0 10 20 30 40 50

ng = 2ng = 3ng = 4ng = 5ng = 6ng = 7ng = 8

Fig. 3. Discriminability of the 1st group features of models at differentlearning stages tested on the Birds validation set. ng means the number ofgroups used in the model.

Fig. 4. Group-wise activation maps of models with 2 semantic groupssuperimposed on original images. The 1st, 2nd, and 3rd rows are respectivelythe original images, activation maps from the 1st and 2nd semantic groups.

causes the difficulty in optimization and, on the other hand,reduces the semantic as well as the discriminability of eachgroup.

D. VisualizationWe studied the group-wise activation maps of a model

with 2 semantic groups based on the backbone of ResNet-18 in Fig. 4. It highlights object parts that activate features incorresponding semantic groups. Apparently, these object partspossess different semantics with strong discriminability suchas the head and belly of the bird and, the wing and enginenacelles of the aircraft.

IV. CONCLUSION

In this letter, we proposed an computationally cheap yeteffective approach that involves a semantic grouping anda feature enhancement module for FGIC. We empiricallystudied the effectiveness of each individual module and theircombining effects through ablation studies, as well as therelationship between the number of groups and the semanticintegrity in each group. Its comparable performance to thestate-of-the-art methods and low computational cost make itpossible to be widely employed in FGIC applications.


REFERENCES

[1] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset forfine-grained image categorization,” in First Workshop on Fine-GrainedVisual Categorization (FGVC) at CVPR, 2011.

[2] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “Thecaltech-ucsd birds-200-2011 dataset,” California Institute of Technology,Tech. Rep., 2011.

[3] J. Krause, M. Stark, J. Deng, and F.-F. Li, “3d object representations forfine-grained categorization,” in 4th IEEE Workshop on 3D Representa-tion and Recognition at ICCV, 2013.

[4] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi,“Fine-grained visual classification of aircraft,” in arXiv preprintarXiv:1306.5151, 2013.

[5] J. Fu, H. Zheng, and T. Mei, “Recurrent attention convolutional neuralnetwork for fine-grained image recognition,” in CVPR, 2017.

[6] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convo-lutional neural network for fine-grained image recognition,” in ICCV,2017.

[7] Y. Wang, V. I. Morariu, and L. S. Davis, “Learning a discriminativefilter bank within a cnn for fine-grained recognition,” in CVPR, 2018.

[8] M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-classconstraint for fine-grained image recognition,” in ECCV, 2018.

[9] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Looking for the devil in thedetails: Learning trilinear attention sampling network for fine-grainedimage recognition,” in CVPR, 2019.

[10] Y. Ding, Y. Zhou, Y. Zhu, Q. Ye, and J. Jiao, “Selective sparse samplingfor fine-grained image recognition,” in ICCV, 2019.

[11] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Learning deep bilineartransformation for fine-grained image representation,” in NIPS, 2019.

[12] X. Liu, T. Xia, J. Wang, and Y. Lin, “Fully convolutional attentionlocalization networks,” in arXiv preprint arXiv:1603.06765, 2016.

[13] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning tonavigate for fine-grained classification,” in ECCV, 2018.

[14] W. Luo, X. Yang, X. Mo, Y. Lu, L. S. Davis, J. Li, J. Yang, and S.-N.Lim, “Cross-x learning for fine-grained visual categorization,” in ICCV,2019.

[15] R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining howconcepts are encoded by filters in deep neural networks,” in CVPR,2018.

[16] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra, “A maximum entropyapproach to natural language processing,” Computational linguistics,vol. 22, no. 1, pp. 39–71, 1996.

[17] H. Geoffrey, O. Vinyals, and J. Dean, “Distilling the knowledge in aneural network,” in arXiv preprint arXiv:1503.02531, 2015.

[18] A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum entropy fine-grained classification,” in NIPS, 2018.

[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in CVPR, 2016.

[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet:A large-scale hierarchical image database,” in CVPR, 2009.

[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

[22] P. Zhuang, Y. Wang, and Y. Qiao, “Learning attentive pairwise interac-tion for fine-grained classification,” in AAAI, 2020.

[23] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and constructionlearning for fine-grained image recognition,” in CVPR, 2019.

[24] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie, “Kernelpooling for convolutional neural networks,” in CVPR, 2017.

journal of la learning semantically enhanced …submitted date: 07/5/2020. this work was supported...

Documents