tex-nets: binary patterns encoded convolutional neural ...€¦ · tex-nets: binary pa‡erns...

TEX-Nets: Binary Pa�erns Encoded Convolutional NeuralNetworks for Texture Recognition

Rao Muhammad AnwerDepartment of Computer Science, Aalto University School

of Science, Finlandrao.anwer@aalto.�

Fahad Shahbaz KhanComputer Vision Laboratory, Linkoping University,

[email protected]

Joost van de WeijerComputer Vision Centre Barcelona, Universitat Autonoma

de Barcelona, [email protected]

Jorma LaaksonenDepartment of Computer Science, Aalto University School

of Science, Finlandjorma.laaksonen@aalto.�

ABSTRACTRecognizing materials and textures in realistic imaging conditionsis a challenging computer vision problem. For many years, localfeatures based orderless representations were a dominant approachfor texture recognition. Recently deep local features, extractedfrom the intermediate layers of a Convolutional Neural Network(CNN), are used as �lter banks. �ese dense local descriptors froma deep model, when encoded with Fisher Vectors, have shown toprovide excellent results for texture recognition. �e CNN models,employed in such approaches, take RGB patches as input and trainon a large amount of labeled images. We show that CNN models,which we call TEX-Nets, trained using mapped coded images withexplicit texture information provide complementary informationto the standard deep models trained on RGB patches. We furtherinvestigate two deep architectures, namely early and late fusion,to combine the texture and color information. Experiments areconducted on four benchmark texture datasets. On all datasets,our results demonstrate that TEX-Nets provide complementaryinformation to standard RGB deep network. Our approach providesa large gain of 4.8%, 3.5%, 2.6% and 4.1% respectively in accuracyon the DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10 datasets,compared to the standard RGB network of the same architecture.Further, we show that our �nal combination leads to consistentimprovements over the state-of-the-art on all four datasets.

KEYWORDSConvolutional Neural Networks, Texture Recognition, Local BinaryPa�erns

ACM Reference format:RaoMuhammadAnwer, Fahad Shahbaz Khan, Joost van deWeijer, and JormaLaaksonen. 2017. TEX-Nets: Binary Pa�erns Encoded Convolutional NeuralNetworks for Texture Recognition. In Proceedings of ICMR ’17, June 6–9,2017, Bucharest, Romania., , 8 pages.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permi�ed. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected] ’17, June 6–9, 2017, Bucharest, Romania.© 2017 ACM. ISBN 978-1-4503-4701-3/17/06. . .$15.00.DOI: h�p://dx.doi.org/10.1145/icmrfp150

DOI: h�p://dx.doi.org/10.1145/icmrfp150

1 INTRODUCTIONTexture classi�cation is challenging problem where the task is toassociate each texture image to its respective texture category. Rec-ognizing textures play a crucial role in many applications, related tobiomedical imaging, material recognition, document image analy-sis, biometrics and retrieval. Over the years, a variety of approacheshave been proposed [31, 32, 53] for robust texture description inchallenging imaging conditions, including blur, illumination, scaleand rotation variations. Most texture description approaches arebased on orderless distribution of local features leading to the devel-opment of several successful image classi�cation models, includinghistograms of vector quantized �lter responses [28], textons the-ory [29], bag-of-words [10] and later the Fisher Vector [40]. Inthis paper, we investigate the problem of learning robust texturedescription for texture classi�cation.

One of the most successful texture description approaches is thatof Local Binary Pa�erns (LBP) [38] and its variants. �e standardLBP descriptor captures the spatial structure of a texture pa�ern bydescribing the pixel neighbourhood by its binary derivatives whichare used to form a short local binary pa�ern code. �e standard LBPdescriptor was later extended to obtain multi-scale, rotation invari-ant and uniform representations. Later works have improved theperformance of LBP by either combining its di�erent variants [14]or fusing LBP with other texture descriptors and color [20]. Otherthan LBP and its variants, bag-of-words based representations em-ploying a SIFT feature and Fisher Vector encoding scheme havealso shown promising results for texture recognition [8].

Recently, Convolutional Neural Networks (CNNs) have shownsigni�cant performance gains in many computer vision applica-tions, including texture recognition [9]. CNNs or deep networkscomprise of several layers performing convolution and poolingoperations followed by one or more fully connected (FC) layers.�e deep networks are trained using raw image pixels (RGB) witha �xed input size and require large amounts of labeled trainingdata (ImageNet [11]). Generally, deep feature based methods eitheremploy end-to-end training or use raw activations from the FClayers as input to linear SVMs. Several recent works [9, 34] havedemonstrated promising results by using convolutional layer acti-vations instead of the FC layers. �e multi-scale deep local features,

ICMR ’17, , June 6–9, 2017, Bucharest, Romania.Rao Muhammad Anwer, Fahad Shahbaz Khan, Joost van de Weijer, and Jorma Laaksonen

extracted from the convolutional layers of CNNs, are then encodedwith Fisher Vector to obtain a scale-invariant image representation.�e combination of these Fisher Vector CNNs (FV-CNN) with FClayer activations (FC-CNN) currently holds the state-of-the-art fortexture recognition [9].

As mentioned above, the de facto practice when training a CNNis to use RGB values of the image patch as an input to the network.Di�erent to standard approaches, Levi and Hassner [30] propose totrain CNNs on pre-processed texture coded images in addition toRGB for emotion recognition. �e LBP codes are mapped to a 3Dmetric space by applying the approximation of the Earth Mover’sDistance (EMD), resulting in a 3 channel image. An ensemble ofCNN models are trained in their approach using RGB and LBPmapped coded images. A weighted average over the outputs ofthe network ensemble is then computed to obtain the score of anemotion class. Despite their success for emotion recognition, CNNmodels trained using texture coded images are yet to be investi-gated for the task of texture recognition. Further, di�erent fusionstrategies, to combine the RGB and explicit texture coded informa-tion is an open problem and therefore yet to be investigated in thesenetworks. In this paper, we investigate the impact of mapped binarypa�erns encoded deep models for the problem of texture recogni-tion. We evaluate these deep models by constructing both FV-CNNand FC-CNN based deep representations for texture classi�cation.

�e aforementioned texture coded mapped images can be com-bined with RGB in several ways. In the �rst strategy, termed here aslate fusion, separate deep models can be trained using RGB and tex-ture coded mapped images. �e information from the two networkscan then be combined by joining them either at the convolutionallevel or at the fully connected layers. Such a late fusion strategyis also commonly used in action recognition [7, 42] to combineappearance and motion information. In the second strategy, termedhere as early fusion, a joint deep model is trained by aggregatingthe RGB and texture image channels as an input to the network.To the best of our knowledge, we are the �rst to investigate thesetwo fusion approaches, to combine texture coded mapped imageswith RGB, in the context of texture classi�cation.Contributions: In this work, we investigate how to integrate oneof the most popular hand-cra�ed texture descriptor, Local BinaryPa�erns (LBP), within deep learning architectures for texture recog-nition problem. We trained deep models, which we call TEX-Nets,on Local Binary Pa�erns (LBP) based coded images. First, as in [30],the unordered LBP code values are mapped to points in a 3D metricspace by employing Multi Dimensional Scaling (MDS) using code-to-code dissimilarity scores based on approximated Earth Mover’sDistance (EMD). We then investigate two TEX-Nets architectures,early and late fusion, to combine texture and color information. Wefurther evaluate these deep models by constructing representationsfrom the convolutional layer (FV-CNN) and FC layers (FC-CNN) ofthe deep networks.

Experiments are performed on four challenging texture datasets:DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10. Our resultsclearly suggest that TEX-Nets are complementary to standard deepmodel based on only RGB. Late fusion TEX-Net architecture pro-vides signi�cantly superior performance compared to both standardRGB and early fusion TEX-Nets, leading to consistent improve-ments over the state-of-the-art in all datasets.

2 RELATEDWORKHere, we brie�y review the Local Binary Pa�erns (LBP) and itsvariants for texture recognition together with deep feature learning.Local Binary Patterns: In the �eld of texture recognition, lo-cal binary pa�erns (LBP) [38] is one of the most commonly usedtexture description approaches. Besides texture recognition, LBPbased texture description has been applied to other vision tasks,including face recognition [45], object localization [51] and persondetection [49]. �e LBP descriptor works by thresholding inten-sity values of a pixel around its neighborhood. �e threshold iscomputed from the intensity of each neighborhood’s centre pixel.A circular symmetric neighborhood is employed by interpolatingthe locations not exactly at the center of a pixel. A variety of LBPvariants have been proposed in literature, including Local TernaryPa�erns [46], Local Binary Pa�ern Variance [16], Noise TolerantLocal Binary Pa�erns [13], Completed Local Binary Pa�erns [15],Extended Local Binary Pa�erns [35] and Rotation Invariant LocalPhase�antization [39]. In addition to the introduction of di�erentLBP variants, the fusion of LBP descriptor with color features havealso been investigated in previous studies [20, 37].Deep Learning: In recent years Convolutional Neural Networks(CNNs) [26] have shown to provide excellent performance for manycomputer vision tasks. CNNs are generally trained using largeamount of labeled training samples and take �xed sized RGB im-ages as input to a series of convolution, normalization and poolingoperations (termed as layers). �e network typically ends withseveral fully-connected (FC) layers, used to extract features forrecognition. Several a�empts have been made to improve deepnetwork architectures, including increasing the depth of the net-work by introducing additional convolutional layers [17, 43]. Inaddition to RGB based appearance networks, other modalities suchas motion and depth have also been used to construct multi-cuedeep networks for action recognition [42] and RGB-D object recog-nition [12].Deep Learning for Texture Recognition: Recently, deep fea-tures have also been investigated for texture recognition. Brunaand Mallat [2] introduce wavelet convolutional sca�ering network(ScatNet), where no learning is required and convolutional �ltersare de�ned as wavelets. �e work of [4] proposes deep networkbased on multistage principal component analysis (PCANet). �ework of [9] proposes to use the convolutional layers of the deepnetworks as dense local descriptors encoded with Fisher Vector(FV-CNN) to obtain the �nal image representation. �e employeddeep networks in FV-CNN are pre-trained on ImageNet using RGBimages as input. �e use of convolutional layers allows the imagesto be of arbitrary resolution and was shown to generalize well tonew data thereby mitigating the need of �ne-tuning.Di�erences to Our Approach: As discussed above, most existinghand-cra�ed approaches employ LBP and its variants for texturedescription. On the other hand, deep learning based approachesextract features from either the FC layer of the network or usingthe convolutional layers with Fisher Vector encoding. In such cases,the deep networks are pre-trained on ImageNet using RGB imagesas input. Our approach di�ers from the aforementioned state-of-the-art texture recognition method [9] in several aspects. Di�erentto RGB based deep network methods [9], our approach is based on

TEX-Nets: Binary Pa�erns Encoded Convolutional Neural Networks for Texture RecognitionICMR ’17, , June 6–9, 2017, Bucharest, Romania.

Figure 1: On the le�, visualization of �lter weights from the RGB and TEX-Net model with only mapped coded texture infor-mation respectively. On the right, visualization of activations with highest energy from the conv3 layer of RGB (top row) andTEX-Net (bottom row) networks on an example texture image. �e TEX-Net model is trained on the mapped coded images(visualized here in color), obtained by converting LBP codes into a 3D metric space.

learning deep models on Local Binary Pa�erns (LBP) coded images.Further, as in [30], the unordered LBP coded images values aremapped to points in a 3D metric space, resulting in a 3 channelmapped coded image. �e work of [30] investigates the impactof the texture coded deep networks, in an ensemble fashion, foremotion recognition. Di�erent to [30], we investigate learningdeep networks with texture coded information for the problemof texture recognition. Additionally, we investigate two networkarchitectures, early and late fusion, to combine the texture informa-tion (mapped coded images) with color (RGB). Finally, we constructboth Fisher Vector encoded deep convolutional features (FV-CNN)and standard activations from the FC layer (FC-CNN) based imagerepresentations, from these models, for texture recognition.

3 BINARY PATTERNS ENCODEDCONVOLUTIONAL NEURAL NETWORKS

Here, we describe the construction of binary pa�erns encodedCNN models. We start by describing mapped local binary pa�erns(LBP) codes to be used within the training of CNN models. Wethen describe two strategies to fuse the binary pa�erns encodedimages, containing explicit texture information, with RGB. Finally,we present scale-coded deep texture representations with FisherVector encoding obtained from our TEX-Net models.

3.1 Mapped LBP CodesAs mentioned earlier, Local Binary Pa�erns (LBP) is one of themost successful approaches for texture description. LBP descriptorcaptures the local gray-scale distribution, obtained by thresholdingthe intensity values of pixels in a small block by the intensity valueof its centre pixel. �e resulting discrete distribution of LBP codesare binary numbers (lower than threshold (0) or higher than thethreshold (1)). Generally, the LBP codes are computed over a 8 pixelneighborhood resulting in a binary string of eight-bit numbersbetween 0 and 255. However, LBP codes can be computed overany neighborhood size or to the number of sampling points. �e�nal image representation is obtained by computing the histogramof LBP codes over an entire image region which normalizes fortranslation. �e resulting representation is invariant to monotonicphotometric transformations and further invariance with respectto rotation is achieved through rotation invariant mapping.

Integrating the strength of LBP descriptor within deep learningarchitectures, for texture recognition, is an open research problem.However, despite their success, LBP and its variants are not directlyapplicable as CNN inputs, due to the unordered nature of LBPcode values. �is is due to the convolution operation performedwithin CNN models. �e convolution operation is equivalent toa weighted average of the input values, thereby unsuitable forthe unordered LBP set. To counter this problem, Levi et. al [30]proposes to map the LBP codes to points in a 3D metric space usingMulti Dimensional Scaling for emotion recognition. �e resultingtransformed LBP points enable to perform convolution operationswhile approximately preserving the original code-to-code distances.When considering the code similarity, both the number of di�erentbit values and their locations need to be considered. �e workof [30] propose to use Earth Mover’s Distance (EMD) as a measureof the di�erence between two LBP codes. �e proposed strategyaccounts for di�erences in spatial locations of pixel codes. We referto [30] for further details.

3.2 TEX-Nets Models for Texture RecognitionAs discussed earlier, the de-facto standard when training CNNmodels is to use raw RGB pixels values of an image as input. �eseRGB based deep networks have achieved state-of-the-art resultsfor texture recognition recently [9]. In this work, we investigate towhat extent texture coded deep networks complement the standardRGB based CNNmodels. To this end, we train CNNmodels, referredas TEX-Nets, using the mapped LBP coded images (section 3.1) onthe ImageNet ILSVRC-2012 dataset [11]. We employ the VGG-Marchitecture [5] which is similar to Zeiler and Fergus network [50].�e VGG-M network consists of �ve convolutional and three fully-connected layers. �e network takes as input an image of 224× 224dimensions. �e �rst convolutional layer employs smaller stride (1)and receptive �eld. �e second convolutional layer uses a relativelylarger stride (2 compared to 1). �e number of convolution �lters is96 in the �rst convolutional layer, 256 in the second convolutionallayer and 512 in the third and last convolutional layers. Duringtraining, the learning rate is set to 0.001, a weight decay of 0.0005and momentum is set to 0.9.

Both RGB and mapped LBP coded images are likely to containcomplementary information. We therefore further investigate two


Figure 2: Overview of our scale coded deep texture representations. Here we use the TEX-Net model incorporating explicittexture information. �emapped coded images (visualized here in color) obtained by converting LBP codes (shown as grayscalevalues) into a 3D metric space.

di�erent strategies to fuse the two sources (color and texture) ofinformation.Late Fusion: In this strategy, both standard (RGB) and texturecoded networks are trained separately on the ImageNet dataset.�e standard network takes RGB values as input. For texture codednetwork, LBP encoding is applied to each image. �e LBP encodingconverts intensity values in an image to one of the 256 LBP codevalues. �e LBP code values are then mapped into a 3D metricspace. �e resulting 3 channel mapped coded images are then usedas input to CNN models. Despite being e�cient to compute, themapped coded images still introduce a bo�leneck if done on-the-�y. We therefore pre-compute the mapped coded images beforetraining the deep network on ImageNet. Once separately trained,RGB and texture coded networks are combined by concatenatingthe activations from the FC layers of the two networks respectively.�e stacked FC layer activations from the two networks are thenused as features (FC-CNN) which are input to linear SVMs. Gener-ally, activations from the second fully-connected (FC) layer of thenetwork are used in image classi�cation and texture recognition [9].�e two-stream late fusion strategy has been previously used inaction recognition to combine spatial (RGB) and temporal (�ow) in-formation [7, 42]. Further, late fusion strategy has been previouslyshown to provide improved performance compared to early fu-sion, when combining multiple cues, for object recognition [24, 25],object detection [19], and action recognition [18, 23].Early Fusion: As an alternative strategy, we also investigate com-bining RGB and texture coded images in an early fusion manner.In this strategy, RGB and 3 channel mapped coded texture imagesare stacked together resulting in a 6 channel image. As a result, theinput to CNN is an image of 224 × 224 × 6 dimensions. We alsoinvestigated converting the 3 channel mapped coded images into asingle channel and combining it with the three RGB channels. Inboth networks, the �lters are learned jointly on the RGB and texturecoded images. In �gure 1 we show the �lters learned by traininga network using the standard RGB and our mapped coded textureimages. Additionally, we show the activations with the highestenergy from the conv3 layer of the RGB (top row) and TEX-Net(bo�om row) networks on an example texture image.

3.2.1 Scale Coded Deep Texture Representations. Recently, ithas shown that deep convolutional features, extracted at multiplescales from the convolutional layers of CNNs, provide excellentresults for texture recognition [9]. �ese dense deep convolutionalfeatures are used with Fisher Vector encoding scheme (FV-CNN)

to obtain the �nal image representation. It was also shown thatboth FC-CNN and FV-CNN contain complementary informationand their combination improves the overall texture recognitionperformance [9]. �e use of activations from the convolutionallayers mitigate the need of �ne-tuning, while enabling the inputimages to be of any arbitrary resolution.

Despite extracting deep convolutional features at multiple scales,the �nal image representation in FV-CNN [9] is scale-invariantsince all local deep features are pooled into a single Fisher Vectorrepresentation. Recently, Khan et. al [22] propose scale codingstrategies to explicitly incorporate multi-scale information in the�nal image representation for human a�ribute and action recog-nition. One such strategy is absolute scale coding in which threescale partitions are constructed for small, medium and large scalefeatures. �e �nal image representation is obtained by concatenat-ing the three scale partition based Fisher Encodings. In the workof [22] absolute scale coding was shown to provide superior resultscompared to FV-CNN. �erefore, in this work, we also investigateabsolute scale coded based deep texture representations (here on-wards referred as FV-CNN) with our TEX-Net models. In �gure 2we show the di�erent stages of the construction of absolute scalecoding based deep texture representations.

4 EXPERIMENTSHere, we present the results of our experiments. We �rst provide abaseline comparison of texture coded CNN models with standardRGB based deep network. We then provide a comparison of ourapproach with state-of-the-art results reported in literature.Datasets: We perform experiments on four challenging texturedatasets: DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10. �eDTD dataset consists of 5640 images from 47 texture classes, col-lected from the web. Each texture class consists of 120 imageswith the dataset equally divided into training, validation and test.�e training and test splits are provided by the authors. �e KTH-TIPS-2a dataset consists of 11 texture classes. �e 4752 imagesare captured at 9 di�erent scales, 3 poses and 4 di�erent illumi-nation conditions. Similar to previous works [3, 6, 41], averageclassi�cation performance is reported over the 4 test runs. In eachrun, images from 1 sample are used for test while the images fromthe remaining 3 samples are used as a training. �e KTH-TIPS-2bdataset consists of 11 texture categories. Here, all images from 1sample are used for training while all the images from remaining 3samples are used for testing in each test run. Lastly, the Texture-10


Figure 3: Example images from the four texture datasets: DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10.

dataset consists of 400 images of 10 di�erent texture categories. In�gure 3 we show example images from the four texture datasetsused in our experiments.Experimental Setup: As mentioned earlier, we employ the VGG-M architecture and train the TEX-Net models, from scratch, usingthe Matconvnet library. All the networks are trained on ImageNet2012 training set. �e FC-CNN features are extracted from theFC7 (second last) layer of the network and are of 4096 dimensions.�e FC-CNN features are L2-normalised and input to a linear SVMclassi�er. To construct FV-CNN representations (section 3.2.1),we extract the convolutional features from the output of the lastconvolutional layer of the networks. �e convolutional features areextracted a�er rescaling the image at 9 di�erent scales s ∈ {0.5 +0.1n | n = 0, 1, . . . , 8}. �e resulting dense local features are of 512dimensions. A visual vocabulary of 32 components is constructedusing Gaussian Mixture Model (GMM) with diagonal covariances.�e GMM model parameters are �t using the extracted dense localdescriptors sampled over all scales on the training data. Finally, thelocal descriptors from each scale partition are encoded with FisherVector using the GMM vocabulary. �e �nal image representationis obtained by concatenating the Fisher Vector representations ofthe three scale partitions. We employ VLFeat library to constructGMM and Fisher Vector encoding.Evaluation Criteria: In all cases, the results are reported as aver-age classi�cation accuracy over all texture classes in a dataset. Weemploy one-versus-all SVMs with linear kernel. �e category labelfrom the classi�er providing the highest con�dence is assigned tothe test instance. �e overall classi�cation results are obtained bycalculating the average recognition score over all texture categories.

4.1 Baseline ComparisonWe �rst provide a baseline comparison for both FC-CNN and FV-CNN based deep representations. We compare our models withthe standard RGB based CNN approach and investigate if the TEX-Net models contain complementary information to standard RGBnetwork. We also investigate both early and late fusion schemes forcombining texture and RGB information. For fair comparison, wedo not �ne-tune any CNNmodel and use the same VGG-M networkarchitecture together with the same set of parameters for all CNNmodels. Table 1 shows the baseline comparison on four texturedatasets. On the DTD dataset, the FC-CNN representation from thestandard RGB network obtains a mean accuracy of 63.4%. Our FC-CNN representation based on early fusion TEX-Net-EF-6ch modelobtains 64.0% mean accuracy rate. �e FC-CNN representationbased on early fusion TEX-Net-EF-4ch model obtains 64.6% meanaccuracy. �e FC-CNN representation based on TEX-Net standardmodel provides a classi�cation score of 55.9%. �e best results areobtained when combining the FC-CNN representations from thestandard RGB and mapped coded texture networks in a late fusionmanner (TEX-Net-LF). �is FC-CNN representation from the latefusion TEX-Net-LF model obtains 68.2% mean accuracy rate, whileachieving a signi�cant gain of 4.8% compared to the standard RGBnetwork. Similarly, the FV-CNN representation from the standardRGB network obtains an average accuracy of 66.2%. Our late fusionbased FV-CNN from the TEX-Net-LF model achieves the best resultswith mean recognition of 71.1%.

On the KTH-TIPS-2a dataset, the FC-CNN representation fromthe standard RGB network obtains a mean classi�cation score of81.8%. Among FC-CNN representations from the di�erent TEX-Net models, the best results are obtained when combining the FC-CNN representations from the standard RGB and mapped codedtexture networks in a late fusion manner (TEX-Net-LF). In case of


DTD KTH-TIPS-2a KTH-TIPS-2b Texture-10FC-CNN FV-CNN FC-CNN FV-CNN FC-CNN FV-CNN FC-CNN FV-CNN

Standard RGB 63.4 ±0.7 66.2 ±1.2 81.8 ±5.1 82.2 ±5.6 72.9 ±2.1 74.1±3.1 87.3 87.5TEX-Net Standard 55.9 ±1.1 59.6 ±1.2 68.6 ±5.3 69.5 ±5.8 60.2 ±2.9 64.3 ±4.1 81.7 81.8TEX-Net-EF-6ch 64.0 ±0.8 66.8 ±1.4 82.6 ±5.5 83.1 ±5.5 73.6 ±2.6 76.1 ±3.8 89.1 89.0TEX-Net-EF-4ch 64.6 ±0.9 67.5 ±1.1 83.4 ±5.3 83.9 ±5.8 73.8 ±2.7 76.2 ±3.6 89.3 89.2TEX-Net-LF 68.2 ±0.8 71.1 ±1.2 85.3 ±5.6 86.1 ±5.7 75.5 ±2.7 77.3 ±3.8 91.3 91.4

Table 1: Baseline comparison (in%) of our approacheswith the standardRGBnetwork. We show comparisonwith our di�erentTEX-Net models: based on only texture coded images (TEX-Net Standard), early fusion architecture combining the RGB and 3mapped coded channels (TEX-Net-EF-6ch), early fusion architecture combining the RGB and a single mapped coded channel(TEX-Net-EF-6ch) and the late fusion architecture (TEX-Net-LF) combining standard RGB and TEX-Net standard networks. Inall cases, our representations from TEX-Net-LF outperform the standard RGB network based representations.

DTD KTH-TIPS-2a KTH-TIPS-2b Texture-10MRELBP [33] 44.9 - 69.0 −CLBP [15] 42.6 - 64.2 64.3ELBP [36] 39.9 - 64.8 73.8CLBPHF [55] 50.2 - 68.1 76.1LHS [41] − 73.0 − −WLD [6] − 56.4 − −CNLBP [21] − - − 77.0PM [21] − - − 73.0MWLD [6] − 64.7 − −LQP [48] − 64.2 − −LTP [46] − 60.0 − −CMR [54] − 69.4 − −LVCBP [27] − 61.7 53.6 58.7LEP [52] 38.7 - 63.1 −LBP-HF [1] − - 54.6 −CCTD [20] − 82.7 70.6 82.0TFT [47] − - 66.3 −SM [44] − 88.2 76.0 −FV-CNN [9] 75.5 - 81.5 −�is paper 76.3 88.3 82.4 96.9

Table 2: Comparison (in %) of our �nal representation with the state-of-the-art. Our approach provides consistent improve-ment over the state-of-the-art on all four datasets.

FV-CNN representation from the standard RGB network, a meanaccuracy of 82.2% is obtained. Late fusion based FV-CNN from thestandard RGB and and mapped coded texture networks (TEX-Net-LF) achieves the best results with mean recognition of 86.1%. Onthe KTH-TIPS-2b dataset, a mean classi�cation accuracy of 72.9% isachieved when using the FC-CNN representation from the standardRGB network. �e two early fusion based TEX-Net models provideinferior performance compared to the late fusion based TEX-NetFC-CNN representation. In case of FV-CNN representation, thebest results are again obtained with the late fusion TEX-Net model.Finally, on the Texture-10 dataset, both the FC-CNN and FV-CNNrepresentations from the late fusion TEX-Net model (TEX-Net-LF)provide the best results and outperform the standard RGB network.

In summary, both early and late fusion based TEX-Net modelsimprove the performance compared to only using the standard RGB

network. For both FC-CNN and FV-CNN, the best results are ob-tained from the late fusion of explicit color and texture CNNmodels.A large improvement of 4.8%, 3.5%, 2.6% and 4.1% respectively isachieved by the late fusion of deep color and texture representationson the DTD, KTH-TIPS-2a, KTH-TIPS-2b and Texture-10 datasets,compared to the baseline.

4.2 Comparison with the State-of-the-artAs discussed above, the late fusion of deep color and texture rep-resentations provides superior performance. Here, we compareour approach with state-of-the-art methods in literature. Recently,the work of [9] proposed to combine FC-CNN and FV-CNN repre-sentation from the standard RGB Very deep network [43]. For faircomparison, we also employ the very deep RGB network (VGG-16)and combine it with our mapped coded TEX-Net in a late fusion


fashion. Table 2 provides comparison of our �nal representationwith state-of-the-art methods on four datasets. On the DTD dataset,the FV-CNN [9] provides mean accuracy of 75.5%. Our approachprovides consistent improvements over the state-of-the-art on thisdataset by achieving a mean recognition rate of 75.5%. On theKTH-TIPS-2a dataset, the local-high-order statistics based approachof [41] achieves a classi�cation accuracy of 73.0%. �e local colorvector binary pa�erns based method [27] obtains a recognition rateof 61.7%. Our approach signi�cantly improves the state-of-the-artby achieving a classi�cation score of 88.3%.

On the KTH-TIPS-2b dataset, the extended LBP descriptor basedapproach [36] achieves an accuracy of 58.1%. �e Fisher Vectorand holistic based very deep representation (FV-CNN) obtains anaccuracy of 81.5%. �e approach combining of LBP and Fourierfeatures [1] achieves an accuracy of 54.6%. �e compact color andtexture description method [20] combining color names and het-erogenous texture representations achieves a classi�cation scoreof 70.6%. Our approach improves the state-of-the-art and achievesa classi�cation score of 82.4%. Finally, on the Texture-10 dataset,the compact color and heterogenous texture representation ap-proach [20] provides a recognition accuracy of 82.0%. Our approachoutperforms the CCTD approach [20] by achieving a classi�cationscore of 96.9%.

5 CONCLUSIONS�is paper investigated the integration of the popular hand-cra�edtexture descriptor, Local Binary Pa�erns (LBP), within deep learn-ing architectures for texture recognition problem. We trained deepmodels on the mapped coded images obtained by converting LBPcodes into a 3Dmetric space. We also investigated two TEX-Nets ar-chitectures, early and late fusion, to combine texture coded mappedimages with RGB. Experiments on four challenging texture datasetsclearly demonstrate that the late fusion architecture, fusing featuresfrom the standard RGB and mapped coded texture deep networks,signi�cantly outperforms the standard RGB network. One inter-esting future direction will be to investigate fusion strategies atvarious intermediate layers of the two networks.Acknowledgments: �is work has been funded by the grant251170 of the Academy of Finland, Projects TIN2013-41751- P,TIN2016-79717-R of the Spanish Ministry of Economy, Industry andCompetitiveness, SSF through a grant for the project SymbiCloud,VR starting grant (2016-05543), through the Strategic Area for ICTresearch ELLIIT. �e calculations were performed using computerresources within the Aalto University School of Science ”Science-IT”project and NSC. We also acknowledge the support from Nvidia.

REFERENCES[1] Timo Ahonen, Jiri Matas, Chu He, andMa�i Pietikainen. 2009. Rotation Invariant

Image Description with Local Binary Pa�ern Histogram Fourier Features. InSCIA.

[2] Joan Bruna and Stephane Mallat. 2013. Invariant Sca�ering Convolution Net-works. TSE 35, 8 (2013), 1872–1886.

[3] Barbara Caputo, Eric Hayman, and P Mallikarjuna. 2005. Class-Speci�c MaterialCategorisation. In ICCV.

[4] Tsung-Han Chan, Kui Jia, Shenghua Gao, and Yi Ma. 2014. PCANet: A SimpleDeep Learning Baseline for Image Classi�cation? TIP 24, 12 (2014), 5017–5032.

[5] Ken Chat�eld, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014.Return of the Devil in the Details: Delving Deep into Convolutional Nets. InBMVC.

[6] Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Ma�i Pietikainen, Xilin Chen,and Wen Gao. 2010. WLD: A Robust Local Image Descriptor. PAMI 32, 9 (2010),1705–1720.

[7] Guilhem Cheron, Ivan Laptev, and Cordelia Schmid. 2015. P-CNN: Pose-BasedCNN Features for Action Recognition. In ICCV.

[8] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and An-drea Vedaldi. 2014. Describing Textures in the Wild. In CVPR.

[9] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, and Andrea Vedaldi. 2016.Deep Filter Banks for Texture Recognition, Description, and Segmentation. IJCV118, 1 (2016), 65–94.

[10] G. Csurka, C. Bray, C. Dance, and L. Fan. 2004. Visual categorization with bagsof keypoints. InWorkshop on Statistical Learning in Computer Vision, ECCV.

[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Ima-geNet: A large-scale hierarchical image database. In Proc. CVPR.

[12] Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, andWolfram Burgard. 2015. Multimodal Deep Learning for Robust RGB-D ObjectRecognition. In IROS.

[13] Abdolhossein Fathi and Ahmad Nilchi. 2012. Noise tolerant local binary pa�ernoperator for e�cient texture analysis. PRL 33, 9 (2012), 1093–1100.

[14] Yimo Guo, Guoying Zhao, and Ma�i Pietikainen. 2012. Discriminative featuresfor texture description. PR 45, 10 (2012), 3834–3843.

[15] Zhenhua Guo, Lei Zhang, and David Zhang. 2010. A Completed Modelingof Local Binary Pa�ern Operator for Texture Classi�cation. TIP 19, 6 (2010),1657–1663.

[16] Zhenhua Guo, Lei Zhang, and David Zhang. 2010. Rotation invariant textureclassi�cation using LBP variance (LBPV) with global matching. PR 43, 3 (2010),706–719.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep ResidualLearning for Image Recognition. In CVPR.

[18] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew Bag-danov, Antonio Lopez, and Michael Felsberg. 2013. Coloring Action Recognitionin Still Images. IJCV 105, 3 (2013), 205–221.

[19] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew D.Bagdanov, Maria Vanrell, and Antonio M. Lopez. 2012. Color a�ributes for objectdetection. In CVPR.

[20] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, MichaelFelsberg, and Jorma Laaksonen. 2015. Compact color-texture description fortexture classi�cation. PRL 51 (2015), 16–22.

[21] Fahad Shahbaz Khan, Joost van de Weijer, Sadiq Ali, and Michael Felsberg. 2013.Evaluating the Impact of Color on Texture Recognition. In CAIP.

[22] Fahad Shahbaz Khan, Joost van de Weijer, Rao Muhammad Anwer, AndrewBagdanov, Michael Felsberg, and Jorma Laaksonen. 2016. Scale Coding Bagof Deep Features for Human A�ribute and Action Recognition. arXiv preprintarXiv:1612.04884 (2016).

[23] Fahad Shahbaz Khan, Joost van de Weijer, Rao Muhammad Anwer, MichaelFelsberg, and Carlo Ga�a. 2014. Semantic Pyramids for Gender and ActionRecognition. TIP 23, 8 (2014), ?3633–3645.

[24] Fahad Shahbaz Khan, Joost van de Weijer, and Maria Vanrell. 2009. Top-DownColor A�ention for Object Recognition. In ICCV.

[25] Fahad Shahbaz Khan, Joost van de Weijer, and Maria Vanrell. 2012. ModulatingShape Features by Color A�ention for Object Recognition. IJCV 98, 1 (2012),49–64.

[26] Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, R Howard,Wayne Hubbard, and Lawrence Jackel. 1989. Handwri�en Digit Recognitionwith a Back-Propagation Network. In NIPS.

[27] Seung Ho Lee, Jae Young Choi, Yong Man Ro, and Konstantinos Plataniotis. 2012.Local Color Vector Binary Pa�erns From Multichannel Face Images for FaceRecognition. TIP 21, 4 (2012), 2347–2353.

[28] �omas Leung and Jitendra Malik. 1996. Detecting, localizing and groupingrepeated scene elements from an image. In ECCV.

[29] �omas Leung and Jitendra Malik. 2001. Representing and Recognizing theVisual Appearance of Materials using �ree-dimensional Textons. IJCV 43, 1(2001), 29–44.

[30] Gil Levi and Tal Hassner. 2015. Emotion Recognition in the Wild via Convolu-tional Neural Networks and Mapped Binary Pa�erns. In ICMI.

[31] Li Liu, Paul Fieguth, Yulan Guo, Xiaogang Wang, and Ma�i Pietikainen. 2017.Local binary features for texture classi�cation: Taxonomy and experimentalstudy. PR 62 (2017), 135–160.

[32] Li Liu, Paul Fieguth, Xiaogang Wang, Ma�i Pietikainen, and Dewen Hu. 2016.Evaluation of LBP and Deep Texture Descriptors with a New Robustness Bench-mark. In ECCV.

[33] Li Liu, Songyang Lao, Paul Fieguth, Yulan Guo, Xiaogang Wang, and Ma�iPietikainen. 2016. Median Robust Extended Local Binary Pa�ern for TextureClassi�cation. TIP 25, 3 (2016), 1368–1381.

[34] Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2015. �e Treasurebeneath Convolutional Layers: Cross-convolutional-layer Pooling for ImageClassi�cation. In CVPR.


[35] Li Liu, Lingjun Zhao, Yunli Long, and Paul Fieguth. 2012. Extended local binarypa�erns for texture classi�cation. IMAVIS 30, 2 (2012), 86–99.

[36] Li Liu, Lingjun Zhao, Yunli Long, Gangyao Kuang, and Paul Fieguth. 2012.Extended local binary pa�erns for texture classi�cation. IVC 30, 2 (2012), 86–99.

[37] Topi Maenpaa and Ma�i Pietikainen. 2004. Classi�cation with color and texture:jointly or separately? PR 37, 8 (2004), 1629–1640.

[38] Timo Ojala, Ma�i Pietikainen, and Topi Maenpaa. 2002. Multiresolution Gray-Scale and Rotation Invariant Texture Classi�cation with Local Binary Pa�erns.PAMI 24, 7 (2002), 971–987.

[39] Ville Ojansivu, Esa Rahtu, and Janne Heikkila. 2009. Rotation Invariant LocalPhase�antization for Blur Insensitive Texture Analysis. In ICPR.

[40] Florent Perronnin and Christopher Dance. 2007. Fisher Kernels on Visual Vocab-ularies for Image Categorization. In CVPR.

[41] Gaurav Sharma, Sibt ul Hussain, and Frederic Jurie. 2012. Local Higher-OrderStatistics (LHS) for Texture Categorization and Facial Analysis. In ECCV.

[42] Karen Simonyan and Andrew Zisserman. 2014. Two-Stream ConvolutionalNetworks for Action Recognition in Videos. In NIPS.

[43] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-works for Large-Scale Image Recognition. In ICLR.

[44] Milan Sulc and Jiri Matas. 2014. Fast Features Invariant to Rotation and Scale ofTexture. In ECCV Workshops.

[45] Xiaoyang Tan and Bill Triggs. 2007. Fusing Gabor and LBP Feature Sets forKernel-Based Face Recognition. In AMFG.

[46] Xiaoyang Tan and Bill Triggs. 2010. Enhanced Local Texture Feature Sets for FaceRecognition Under Di�cult Lighting Conditions. TIP 19, 9 (2010), 1635–1650.

[47] Radu Timo�e and Luc Van Gool. 2012. A Training-free Classi�cation Frameworkfor Textures, Writers, and Materials. In BMVC.

[48] Sibt ul Hussain and Bill Triggs. 2012. Visual Recognition Using Local �antizedPa�erns. In ECCV.

[49] Xiaoyu Wang, Tony Han, and Shuicheng Yan. 2009. An HOG-LBP HumanDetector with Partial Occlusion Handling. In ICCV.

[50] Ma�hew Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolu-tional Networks. In ECCV.

[51] Junge Zhang, Kaiqi Huang, Yinan Yu, and Tieniu Tan. 2011. Boosted localstructured HOG-LBP for object localization. In CVPR.

[52] Jun Zhang, Jimin Liang, and Heng Zhao. 2013. Local Energy Pa�ern for TextureClassi�cation Using Self-Adaptive �antization �resholds. TIP 22, 1 (2013),31–42.

[53] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. 2007. Local features andkernels for classi�cation of texture and object catergories: A ComprehensiveStudy. IJCV 73, 2 (2007), 213–218.

[54] Jun Zhang, Heng Zhao, and Jimin Liang. 2013. Continuous rotation invariantlocal descriptors for texton dictionary-based texture classi�cation. CVIU 117, 1(2013), 56–75.

[55] Guoying Zhao, Timo Ahonen, Jiri Matas, and Ma�i Pietikainen. 2012. Rotation-Invariant Image and Video Description With Local Binary Pa�ern Features. TIP21, 4 (2012), 1465–1477.

tex-nets: binary patterns encoded convolutional neural ...€¦ · tex-nets: binary pa‡erns...

Documents